Sort chunks of rows sequentially of multiple input files using a pattern match

I have chunks of data spread across 100 files that when re-sorted follows a numerical sequence. For instance, if I have 100 chunks of data, chunk #1, 3, 5 could be in one file and chunk #2, 4, 6 could be in another. I need to create 1 output file with all the chunks in sequential order: #1,2,3,4,5,6.

Below is a shortened version of 2 (of the 100) input files. Each chunk begins with "ITEM: TIMESTEP" and needs to be organized by the number in the following line (here that's 1000, 2000, 3000, 4000).

INPUT FILE 1

ITEM: TIMETEP

1000

ITEM: NUMBER OF ATOMS

50 2 H 0.4 0.3 0.006

10214 2 H 0.5 0.4 0.002

......#12,000 lines later#...
ITEM: TIMETEP

3000

ITEM: NUMBER OF ATOMS

50 2 H 2.3 1.4 0.3

10214 2 H 2.5 1.3 0.6

......#12,000 lines later#...

INPUT FILE 2

ITEM: TIMETEP

2000

ITEM: NUMBER OF ATOMS

50 2 H 0.4 0.3 0.006

10214 2 H 0.5 0.4 0.002

......#12,000 lines later#...
ITEM: TIMETEP

4000

ITEM: NUMBER OF ATOMS

50 2 H 2.3 1.4 0.3

10214 2 H 2.5 1.3 0.6

......#12,000 lines later#...

The final output file would look like this

ITEM: TIMETEP

1000

....#rest of chunk#...

ITEM: TIMETEP

2000

....#rest of chunk#...

ITEM: TIMETEP

3000

....#rest of chunk#...

ITEM: TIMETEP

4000

....#rest of chunk#...

So far, I've inserted an identifier string called "IDENTIFIER" before the start of each chunk:

awk -v n=12,000 '1; NR%n==0 print "IDENTIFIER"' in.txt >> out1.txt

And I can print the N rows needed per each chunk that follows each idenitfier string, looping through multiple files

for i in $(seq 1000 1000 10000); do awk 'c&&c--;/IDENTIFIER/c=12,000' out$i.txt >> out-final.txt done

I used this method to specifically identify the 2nd row of each chunk because those numbers could be repeated within the chunk itself. However, I don't know how to modify the 2nd command line so that it only prints to out-final.txt when the value after IDENTIFIER is the next number in the sequence.

3 Answers
3

I suggest another approach, first split files so that each item is in its own file, then merge back the files in desired order. For example for the given two files

$ awk '/^ITEM: TIMETEP/h=$0; next h f="item_"$0; print h > f; h="" print > f' file1 file2

will create the four extracts, which can be merged back, simply

$ cat item_1..4000 > merged_items

Maybe add close(f) before redefining f.
– kvantour
Aug 29 at 7:51

close(f)

f

I'd use perl for this

cat file1,2 | perl -0777 -ne ' @records = split /^(?=ITEM: TIMETEP)/m; print join "", map $_->[1] sort $a->[0] <=> $b->[0] map ($n) = /n(d+)n/; [$n, $_] @records; '

The -0777 options forces perl to slurp the entire input into a single string. We use the header to split into records. Then a Schwartzian transform to sort, and join the records together again and print.

-0777

If you enjoy pain, here's the line-noisy one-liner version:

cat file1,2 | perl -0777 -pe'$_=join"",map$_->[1]sort$a->[0]<=>$b->[0]map[/n(d+)n/,$_]split/^(?=ITEM: TIMETEP)/m'

Prefix each record with the record ID from line 2 of each record and the line number since the start of that record, sort on that record ID and line number, then remove them again after the sort:

$ cat tst.sh awk ' BEGIN OFS="t" /^ITEM: TIMETEP/ head=$0; lineNr=1; next lineNr == 1 recId=$0; print recId, lineNr, head print recId, ++lineNr, $0 ' "$@" | sort -k1,2n | cut -f3- $ ./tst.sh file1 file2 ITEM: TIMETEP 1000 ITEM: NUMBER OF ATOMS 50 2 H 0.4 0.3 0.006 10214 2 H 0.5 0.4 0.002 ......#12,000 lines later#... ITEM: TIMETEP 2000 ITEM: NUMBER OF ATOMS 50 2 H 0.4 0.3 0.006 10214 2 H 0.5 0.4 0.002 ......#12,000 lines later#... ITEM: TIMETEP 3000 ITEM: NUMBER OF ATOMS 50 2 H 2.3 1.4 0.3 10214 2 H 2.5 1.3 0.6 ......#12,000 lines later#... ITEM: TIMETEP 4000 ITEM: NUMBER OF ATOMS 50 2 H 2.3 1.4 0.3 10214 2 H 2.5 1.3 0.6 ......#12,000 lines later#...

Since the only command above that's processing all of the input "at once" (as opposed to line by line) is sort it'll work for large numbers of large files since sort is designed to do paging, etc. to handle large input (see https://unix.stackexchange.com/a/279099/133219).

sort

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt