Matching two files and printing lines that appear first time

I have two files that look like this:

file1 (unique IDs):

C84610112 C96209347 C84774620 C84774691 C85594749 C89372772 C89651687 C89845500 C89914896 C91269765 C91526663 C92210411 C92254517 C93709504 C94303303 C95100561 C95100609 C95417520 C95696352 C96045246 C96045496 C96060727 C96076986

and file2:

1 C95696352 score: -69.785 nathvy = 38 nconfs = 888 2 C98230482 score: -57.431 nathvy = 47 nconfs = 575 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188 4 C36510773 score: -56.502 nathvy = 38 nconfs = 7595 5 C04355288 score: -56.400 nathvy = 41 nconfs = 50502 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228 7 C96209347 score: -54.713 nathvy = 24 nconfs = 162 8 C96209347 score: -53.901 nathvy = 24 nconfs = 159 9 C06169346 score: -53.438 nathvy = 22 nconfs = 105 10 C95696352 score: -52.848 nathvy = 38 nconfs = 878 11 C98216318 score: -52.061 nathvy = 52 nconfs = 1092 12 C04285713 score: -52.009 nathvy = 38 nconfs = 1355 13 C96209347 score: -51.477 nathvy = 24 nconfs = 1375 14 C98222837 score: -50.730 nathvy = 34 nconfs = 588 15 C98216318 score: -50.694 nathvy = 52 nconfs = 1136 16 C32832068 score: -50.546 nathvy = 22 nconfs = 548 17 C95696352 score: -50.475 nathvy = 38 nconfs = 3220 18 C32832068 score: -50.457 nathvy = 22 nconfs = 16235 19 C95696352 score: -50.234 nathvy = 38 nconfs = 3048 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536 21 C72332782 score: -49.676 nathvy = 41 nconfs = 3942 22 C97970648 score: -49.616 nathvy = 45 nconfs = 17640 23 C04285713 score: -49.594 nathvy = 38 nconfs = 14038 24 C98043133 score: -49.370 nathvy = 43 nconfs = 1236 25 C89372772 score: -49.308 nathvy = 22 nconfs = 471 26 C97970648 score: -49.297 nathvy = 45 nconfs = 17850 27 C85594749 score: -49.122 nathvy = 44 nconfs = 4158 28 C70006381 score: -49.092 nathvy = 24 nconfs = 880

I would like to match IDs from file1 with IDs in file2 (second column) and for those that are matching to print them. Also, in file2 some IDs are repeating, such as C96209347 (although whole lines are not identical). I would like to grep those lines that are appearing for the first time only and others to skip. So in this specific example with C96209347 only third line from file2 should be printed. Anybody can help?

file1

file2

C96209347

file2

2 Answers
2

Try this,

grep -f file1 file2 | awk '!_[$2]++' 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Explanation

grep -f file1 file2

awk '!_[$2]++'

$2

_

_[$2]++

$2

_[$2]

!

print

This works. Thank you very much! All the best
– sergio
Aug 31 at 7:45

Wow, nice and simple solution.
– abu_bua
Sep 24 at 12:35

With awk alone:

$ awk 'NR==FNR a[$1]=1; next $2 in a print; delete a[$2]' file1 file2 1 C95696352 score: -69.785 nathvy = 38 nconfs = 888 3 C96209347 score: -57.128 nathvy = 24 nconfs = 1188 6 C89372772 score: -55.728 nathvy = 22 nconfs = 3228 20 C85594749 score: -49.780 nathvy = 44 nconfs = 4536

Thanks for contributing an answer to Ask Ubuntu!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt