Sort chunks of rows sequentially of multiple input files using a pattern match
Sort chunks of rows sequentially of multiple input files using a pattern match
I have chunks of data spread across 100 files that when re-sorted follows a numerical sequence. For instance, if I have 100 chunks of data, chunk #1, 3, 5 could be in one file and chunk #2, 4, 6 could be in another. I need to create 1 output file with all the chunks in sequential order: #1,2,3,4,5,6.
Below is a shortened version of 2 (of the 100) input files. Each chunk begins with "ITEM: TIMESTEP" and needs to be organized by the number in the following line (here that's 1000, 2000, 3000, 4000).
INPUT FILE 1
ITEM: TIMETEP
1000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
3000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...
INPUT FILE 2
ITEM: TIMETEP
2000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
4000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...
The final output file would look like this
ITEM: TIMETEP
1000
....#rest of chunk#...
ITEM: TIMETEP
2000
....#rest of chunk#...
ITEM: TIMETEP
3000
....#rest of chunk#...
ITEM: TIMETEP
4000
....#rest of chunk#...
So far, I've inserted an identifier string called "IDENTIFIER" before the start of each chunk:
awk -v n=12,000 '1; NR%n==0 print "IDENTIFIER"' in.txt >> out1.txt
And I can print the N rows needed per each chunk that follows each idenitfier string, looping through multiple files
for i in $(seq 1000 1000 10000); do
awk 'c&&c--;/IDENTIFIER/c=12,000' out$i.txt >> out-final.txt
done
I used this method to specifically identify the 2nd row of each chunk because those numbers could be repeated within the chunk itself. However, I don't know how to modify the 2nd command line so that it only prints to out-final.txt when the value after IDENTIFIER is the next number in the sequence.
3 Answers
3
I suggest another approach, first split files so that each item is in its own file, then merge back the files in desired order. For example for the given two files
$ awk '/^ITEM: TIMETEP/h=$0; next
h f="item_"$0; print h > f; h=""
print > f' file1 file2
will create the four extracts, which can be merged back, simply
$ cat item_1..4000 > merged_items
close(f)
f
I'd use perl for this
cat file1,2 | perl -0777 -ne '
@records = split /^(?=ITEM: TIMETEP)/m;
print join "",
map $_->[1]
sort $a->[0] <=> $b->[0]
map ($n) = /n(d+)n/; [$n, $_]
@records;
'
The -0777
options forces perl to slurp the entire input into a single string. We use the header to split into records. Then a Schwartzian transform to sort, and join the records together again and print.
-0777
If you enjoy pain, here's the line-noisy one-liner version:
cat file1,2 | perl -0777 -pe'$_=join"",map$_->[1]sort$a->[0]<=>$b->[0]map[/n(d+)n/,$_]split/^(?=ITEM: TIMETEP)/m'
Prefix each record with the record ID from line 2 of each record and the line number since the start of that record, sort on that record ID and line number, then remove them again after the sort:
$ cat tst.sh
awk '
BEGIN OFS="t"
/^ITEM: TIMETEP/ head=$0; lineNr=1; next
lineNr == 1 recId=$0; print recId, lineNr, head
print recId, ++lineNr, $0
' "$@" |
sort -k1,2n |
cut -f3-
$ ./tst.sh file1 file2
ITEM: TIMETEP
1000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
2000
ITEM: NUMBER OF ATOMS
50 2 H 0.4 0.3 0.006
10214 2 H 0.5 0.4 0.002
......#12,000 lines later#...
ITEM: TIMETEP
3000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...
ITEM: TIMETEP
4000
ITEM: NUMBER OF ATOMS
50 2 H 2.3 1.4 0.3
10214 2 H 2.5 1.3 0.6
......#12,000 lines later#...
Since the only command above that's processing all of the input "at once" (as opposed to line by line) is sort
it'll work for large numbers of large files since sort
is designed to do paging, etc. to handle large input (see https://unix.stackexchange.com/a/279099/133219).
sort
sort
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Maybe add
close(f)
before redefiningf
.– kvantour
Aug 29 at 7:51