Count lines wider than 80 columns, taking tabs correctly into account

To count lines wider than 80 columns I am, currently, using this command:

$ git grep -h -c -v '^.,80$' **/*.c,h,pl,y |awk 'BEGIN i=0 i+=$1 END printf ("%dn", i) ' 44984

Unfortunately, the repo uses tabs for indenting so the grep pattern
is inaccurate. Is there anyway to have the regex treat tabs at the
standard width of 8 chars like how wc -L does?

grep

regex

wc -L

For the purpose of this question, we may assume the contributors were disciplined enough to indent consistently, or that they have git commit hooks in lieu of discipline.

git commit

For reasons related to performance, I’d prefer a solution that works inside
git-grep(1) or maybe another grep tool, without preprocessing files.

git-grep(1)

grep

4 Answers
4

If we can assume per your comment that tab characters will appear only at the beginning of lines, then we can count alternatives to a minimum of 80 characters.

The resulting mess is as follows, with your awk statement summing the individual line counts to provide a grand total

awk

git grep -hcP '^(.81,|t.73,|t2.65,|t3.57,|t4.49,|t5.41,|t6.33,|t7.25,|t8.17,|t9.9,|t10.)' **/*.c,h,pl,y | awk ' i+=$1 END printf ("%dn", i) '

Note that git grep -P (at least with my 2.18.0 version on Debian here) doesn't work with multi-byte characters. For instance, it considers that © (common in source files) is 2 characters instead of one when encoded in UTF-8. It's OK with -E. You can work around it in UTF-8 locales by writing git grep -hcP '(*UTF8)...'

– Stéphane Chazelas
Sep 14 '18 at 15:01

git grep -P

©

-E

git grep -hcP '(*UTF8)...'

@StéphaneChazelas if I change to -E then t isn't recognised. The quick fix for that, I suppose, is to use an Ansi C -quoted string $'...' instead a single-quoted one '...'.

– roaima
Sep 14 '18 at 16:34

-E

t

$'...'

'...'

Preprocess the files by piping them through expand. The expand utility will expand tabs appropriately (using the standard tab stops at every 8th character).

expand

find . -type f ( -name '*.[ch]' -o -name '*.p[ly]' ) -exec expand + | awk 'length > 80 n++ END print n '

GNU wc -L doesn't treat TABs as 8 characters, it treats TABs as they would be displayed in a terminal with TAB stops every 8 columns so would have a "width" ranging from 1 to 8 characters depending on where they're found on the line. wc -L also considers the display width of other characters (whether they're 0, 1 or 2 columns wide) and also processes f and r "correctly".

wc -L

f

r

$ printf 'abcdetn' | wc -L 8

Here, you could use expand (which by default also assumes tab stops every 8 columns though you can change it with options) to expand those TABs to spaces:

expand

git grep -h '' ./**/*.c,h,pl,y | expand | tr 'fr' 'nn' | grep -cE '.81'

(converting the CRs (which when sent to a terminal move the cursor back to the beginning of the line) and FFs (which some display devices understand as a page-break) to LF to get the same behaviour as wc -L, but ignoring the other ones which anyway we can't tell what influence they will have on the display width).

wc -L

That covers TABs but not single-width or double-width characters. Note that the GNU implementation of expand currently doesn't expand TABs properly if there are multi-byte characters (let alone zero-width or double-width ones).

expand

$ printf 'ééééétn' | wc -L 8 $ printf 'ééééétn' | expand | wc -L 11

Also note that ./**/*.c,h,pl,y would by default skip hidden files or files in hidden directories. As the brace expansion expands to several globs, you would also get errors (fatal with zsh or bash -O failglob) if either of those globs don't match.

./**/*.c,h,pl,y

zsh

bash -O failglob

With zsh, you'd use ./**/*.(c|h|p[ly])(D.) which is one glob, and where D includes hidden files and . restricts to regular files.

zsh

./**/*.(c|h|p[ly])(D.)

D

.

For a solution that takes into account the actual width of characters (assuming all the text files are encoded in the locale's character encoding) you could use:

git grep -h '' ./**/*.(c|h|p[ly])(.) | tr 'rf' 'nn' | perl -Mopen=locale -MText::Tabs -MText::CharWidth=mbswidth -lne ' $n++ if mbswidth(expand($_)) > 80; ENDprint 0+$n'

Note that at least on GNU systems, mbswidth() considers control characters as having a width of -1 and 1 for expand(). We assume no control character other than CR, NL, TAB, FF are found in the files.

mbswidth()

-1

expand()

“GNU wc -L doesn't treat TABs as 8 characters, it treats TABs as it would be displayed in a terminal with TAB stops every 8 columns.” You’re correct, of course, but when tabs are used only for indenting that boils down to the same thing.

– phg
Sep 14 '18 at 8:22

For the purpose of this question we may assume the contributors were disciplined enough to indent consistently (or that they have git commit hooks in lieu of discipline).

– phg
Sep 14 '18 at 8:57

@phg, looking at the Linux kernel source tree (an old checkout from May I had lying about), @roaima's approach finds 32408 too many lines. Not so much about mixed tab+spc indenting, but because tab is also used for column alignments for table-like sequences of #define symbol value or declarations.

– Stéphane Chazelas
Sep 14 '18 at 13:47

#define symbol value

Interesting result, but we’re not dealing with the the kernel tree here.

– phg
Sep 14 '18 at 14:16

@jamesqf tab is 8 characters. If you want to change that it's your issue. Almost all utilities that understand tab as a tab action assume 8 character tab stops. Personally I use softtabs at 2 or 4 character indents; as indents build up to 8 my editor replaces them with a tab (unless I'm writing python).

– roaima
Sep 14 '18 at 18:42

A solution with ex (from vi). Albeit slow.

As vi is able to correctly process UTF-8 data:

It could expand tabs to spaces, count control characters as 1, process r t f v correctly and also process most of valid UNICODE values. Including either composed (NKC) and decomposed (NKD) accents, and characters from Cyrillic, Arabic, Greek, Chinese, and many others.

r

t

f

v

$ cat script.sh #!/bin/bash -- declare -i count=0 for i do # Set ex script in one variable a='set expandtab " Expand tabs to spaces r '"$i"' " Read original file g/^.,80$/d " Remove all lines shorter than the value used wq " Quit ' o=outfile; :>"$o" # Clean output file ex -s "$o" <<<"$a" # process lines in $i file count+=$(wc -l <"$o") # count and accumulate number of lines. done echo "$count"

Call script as:

$ script.sh **/*.c,h,pl,y 44984

Thanks for contributing an answer to Unix & Linux Stack Exchange!

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt