Count lines wider than 80 columns, taking tabs correctly into account

Count lines wider than 80 columns, taking tabs correctly into account



To count lines wider than 80 columns I am, currently, using this command:


$ git grep -h -c -v '^.,80$' **/*.c,h,pl,y
|awk 'BEGIN i=0 i+=$1 END printf ("%dn", i) '
44984



Unfortunately, the repo uses tabs for indenting so the grep pattern
is inaccurate. Is there anyway to have the regex treat tabs at the
standard width of 8 chars like how wc -L does?


grep


regex


wc -L



For the purpose of this question, we may assume the contributors were disciplined enough to indent consistently, or that they have git commit hooks in lieu of discipline.


git commit



For reasons related to performance, I’d prefer a solution that works inside
git-grep(1) or maybe another grep tool, without preprocessing files.


git-grep(1)


grep




4 Answers
4



If we can assume per your comment that tab characters will appear only at the beginning of lines, then we can count alternatives to a minimum of 80 characters.



The resulting mess is as follows, with your awk statement summing the individual line counts to provide a grand total


awk


git grep -hcP '^(.81,|t.73,|t2.65,|t3.57,|t4.49,|t5.41,|t6.33,|t7.25,|t8.17,|t9.9,|t10.)' **/*.c,h,pl,y |
awk ' i+=$1 END printf ("%dn", i) '






Note that git grep -P (at least with my 2.18.0 version on Debian here) doesn't work with multi-byte characters. For instance, it considers that © (common in source files) is 2 characters instead of one when encoded in UTF-8. It's OK with -E. You can work around it in UTF-8 locales by writing git grep -hcP '(*UTF8)...'

– Stéphane Chazelas
Sep 14 '18 at 15:01



git grep -P


©


-E


git grep -hcP '(*UTF8)...'






@StéphaneChazelas if I change to -E then t isn't recognised. The quick fix for that, I suppose, is to use an Ansi C -quoted string $'...' instead a single-quoted one '...'.

– roaima
Sep 14 '18 at 16:34



-E


t


$'...'


'...'



Preprocess the files by piping them through expand. The expand utility will expand tabs appropriately (using the standard tab stops at every 8th character).


expand


expand


find . -type f ( -name '*.[ch]' -o -name '*.p[ly]' ) -exec expand + |
awk 'length > 80 n++ END print n '



GNU wc -L doesn't treat TABs as 8 characters, it treats TABs as they would be displayed in a terminal with TAB stops every 8 columns so would have a "width" ranging from 1 to 8 characters depending on where they're found on the line. wc -L also considers the display width of other characters (whether they're 0, 1 or 2 columns wide) and also processes f and r "correctly".


wc -L


wc -L


f


r


$ printf 'abcdetn' | wc -L
8



Here, you could use expand (which by default also assumes tab stops every 8 columns though you can change it with options) to expand those TABs to spaces:


expand


git grep -h '' ./**/*.c,h,pl,y | expand | tr 'fr' 'nn' | grep -cE '.81'



(converting the CRs (which when sent to a terminal move the cursor back to the beginning of the line) and FFs (which some display devices understand as a page-break) to LF to get the same behaviour as wc -L, but ignoring the other ones which anyway we can't tell what influence they will have on the display width).


wc -L



That covers TABs but not single-width or double-width characters. Note that the GNU implementation of expand currently doesn't expand TABs properly if there are multi-byte characters (let alone zero-width or double-width ones).


expand


$ printf 'ééééétn' | wc -L
8
$ printf 'ééééétn' | expand | wc -L
11



Also note that ./**/*.c,h,pl,y would by default skip hidden files or files in hidden directories. As the brace expansion expands to several globs, you would also get errors (fatal with zsh or bash -O failglob) if either of those globs don't match.


./**/*.c,h,pl,y


zsh


bash -O failglob



With zsh, you'd use ./**/*.(c|h|p[ly])(D.) which is one glob, and where D includes hidden files and . restricts to regular files.


zsh


./**/*.(c|h|p[ly])(D.)


D


.



For a solution that takes into account the actual width of characters (assuming all the text files are encoded in the locale's character encoding) you could use:


git grep -h '' ./**/*.(c|h|p[ly])(.) | tr 'rf' 'nn' |
perl -Mopen=locale -MText::Tabs -MText::CharWidth=mbswidth -lne '
$n++ if mbswidth(expand($_)) > 80;
ENDprint 0+$n'



Note that at least on GNU systems, mbswidth() considers control characters as having a width of -1 and 1 for expand(). We assume no control character other than CR, NL, TAB, FF are found in the files.


mbswidth()


-1


expand()






“GNU wc -L doesn't treat TABs as 8 characters, it treats TABs as it would be displayed in a terminal with TAB stops every 8 columns.” You’re correct, of course, but when tabs are used only for indenting that boils down to the same thing.

– phg
Sep 14 '18 at 8:22






For the purpose of this question we may assume the contributors were disciplined enough to indent consistently (or that they have git commit hooks in lieu of discipline).

– phg
Sep 14 '18 at 8:57






@phg, looking at the Linux kernel source tree (an old checkout from May I had lying about), @roaima's approach finds 32408 too many lines. Not so much about mixed tab+spc indenting, but because tab is also used for column alignments for table-like sequences of #define symbol value or declarations.

– Stéphane Chazelas
Sep 14 '18 at 13:47


#define symbol value






Interesting result, but we’re not dealing with the the kernel tree here.

– phg
Sep 14 '18 at 14:16






@jamesqf tab is 8 characters. If you want to change that it's your issue. Almost all utilities that understand tab as a tab action assume 8 character tab stops. Personally I use softtabs at 2 or 4 character indents; as indents build up to 8 my editor replaces them with a tab (unless I'm writing python).

– roaima
Sep 14 '18 at 18:42



A solution with ex (from vi). Albeit slow.



As vi is able to correctly process UTF-8 data:



It could expand tabs to spaces, count control characters as 1, process r t f v correctly and also process most of valid UNICODE values. Including either composed (NKC) and decomposed (NKD) accents, and characters from Cyrillic, Arabic, Greek, Chinese, and many others.


r


t


f


v


$ cat script.sh
#!/bin/bash --

declare -i count=0

for i do
# Set ex script in one variable
a='set expandtab " Expand tabs to spaces
r '"$i"' " Read original file
g/^.,80$/d " Remove all lines shorter than the value used
wq " Quit '

o=outfile; :>"$o" # Clean output file
ex -s "$o" <<<"$a" # process lines in $i file
count+=$(wc -l <"$o") # count and accumulate number of lines.
done

echo "$count"



Call script as:


$ script.sh **/*.c,h,pl,y
44984



Thanks for contributing an answer to Unix & Linux Stack Exchange!



But avoid



To learn more, see our tips on writing great answers.



Required, but never shown



Required, but never shown




By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ