Sort chunks of rows sequentially of multiple input files using a pattern match

Sort chunks of rows sequentially of multiple input files using a pattern match



I have chunks of data spread across 100 files that when re-sorted follows a numerical sequence. For instance, if I have 100 chunks of data, chunk #1, 3, 5 could be in one file and chunk #2, 4, 6 could be in another. I need to create 1 output file with all the chunks in sequential order: #1,2,3,4,5,6.



Below is a shortened version of 2 (of the 100) input files. Each chunk begins with "ITEM: TIMESTEP" and needs to be organized by the number in the following line (here that's 1000, 2000, 3000, 4000).



INPUT FILE 1



ITEM: TIMETEP

1000

ITEM: NUMBER OF ATOMS

50 2 H 0.4 0.3 0.006

10214 2 H 0.5 0.4 0.002

......#12,000 lines later#...
ITEM: TIMETEP

3000

ITEM: NUMBER OF ATOMS

50 2 H 2.3 1.4 0.3

10214 2 H 2.5 1.3 0.6

......#12,000 lines later#...



INPUT FILE 2



ITEM: TIMETEP

2000

ITEM: NUMBER OF ATOMS

50 2 H 0.4 0.3 0.006

10214 2 H 0.5 0.4 0.002

......#12,000 lines later#...
ITEM: TIMETEP

4000

ITEM: NUMBER OF ATOMS

50 2 H 2.3 1.4 0.3

10214 2 H 2.5 1.3 0.6

......#12,000 lines later#...



The final output file would look like this



ITEM: TIMETEP

1000

....#rest of chunk#...

ITEM: TIMETEP

2000

....#rest of chunk#...

ITEM: TIMETEP

3000

....#rest of chunk#...

ITEM: TIMETEP

4000

....#rest of chunk#...



So far, I've inserted an identifier string called "IDENTIFIER" before the start of each chunk:


awk -v n=12,000 '1; NR%n==0 print "IDENTIFIER"' in.txt >> out1.txt



And I can print the N rows needed per each chunk that follows each idenitfier string, looping through multiple files


for i in $(seq 1000 1000 10000); do
awk 'c&&c--;/IDENTIFIER/c=12,000' out$i.txt >> out-final.txt
done



I used this method to specifically identify the 2nd row of each chunk because those numbers could be repeated within the chunk itself. However, I don't know how to modify the 2nd command line so that it only prints to out-final.txt when the value after IDENTIFIER is the next number in the sequence.




3 Answers
3



I suggest another approach, first split files so that each item is in its own file, then merge back the files in desired order. For example for the given two files


$ awk '/^ITEM: TIMETEP/h=$0; next
h f="item_"$0; print h > f; h=""
print > f' file1 file2



will create the four extracts, which can be merged back, simply


$ cat item_1..4000 > merged_items





Maybe add close(f) before redefining f.
– kvantour
Aug 29 at 7:51


close(f)


f



I'd use perl for this


cat file1,2 | perl -0777 -ne '
@records = split /^(?=ITEM: TIMETEP)/m;
print join "",
map $_->[1]
sort $a->[0] <=> $b->[0]
map ($n) = /n(d+)n/; [$n, $_]
@records;
'



The -0777 options forces perl to slurp the entire input into a single string. We use the header to split into records. Then a Schwartzian transform to sort, and join the records together again and print.


-0777



If you enjoy pain, here's the line-noisy one-liner version:


cat file1,2 | perl -0777 -pe'$_=join"",map$_->[1]sort$a->[0]<=>$b->[0]map[/n(d+)n/,$_]split/^(?=ITEM: TIMETEP)/m'



Prefix each record with the record ID from line 2 of each record and the line number since the start of that record, sort on that record ID and line number, then remove them again after the sort:


$ cat tst.sh
awk '
BEGIN OFS="t"
/^ITEM: TIMETEP/ head=$0; lineNr=1; next
lineNr == 1 recId=$0; print recId, lineNr, head
print recId, ++lineNr, $0
' "$@" |
sort -k1,2n |
cut -f3-

$ ./tst.sh file1 file2
ITEM: TIMETEP
1000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
2000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
3000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...
ITEM: TIMETEP
4000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...



Since the only command above that's processing all of the input "at once" (as opposed to line by line) is sort it'll work for large numbers of large files since sort is designed to do paging, etc. to handle large input (see https://unix.stackexchange.com/a/279099/133219).


sort


sort






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

Edmonton

Crossroads (UK TV series)