Extract in specific column with patter file
1
1
Entering edit mode
5.1 years ago
flogin ▴ 280

I have a file with 2 columns, the first column have a read ID from fastq, the second column have a ncbi taxi ID, like this:

ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067

and I have an archive with the taxid list of a specific taxos, like this:

2485233
2485231
2059665
2029516
2022430
2022429
1987726
1980986
1738445
1737346

I want to extract the lines with the taxid present on archive 2, but, some of the taxids presents the same number present on read ID. So I tried extract only the correct lines with fgrep, but, fgrep can't specify a column (in the case the column 2).

Anyone can help me? Briefly I need extract the whole line detecting a pattern in column 2 using a file with patterns.

Thanks !

grep awk • 1.0k views
ADD COMMENT
0
Entering edit mode

output:

$ join -1 1 -2 2 <(sort -k1 test1.txt) <(sort -k2 test.txt) -o 2.1,2.2
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468 512
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468 9606

input:

$ cat test1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

$ cat test.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067
ADD REPLY
4
Entering edit mode
5.1 years ago

Assuming tab-delimited files:

cat file1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

cat file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067


awk 'FNR==NR {a[$1]==$1; next} {if ($2 in a) print $0}' file1.txt file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ADD COMMENT
1
Entering edit mode

This worked, thanks a lot Kevin !!

ADD REPLY
0
Entering edit mode

You are welcome. Goodnight / Buenas noches (if you want an explanation of what the command is doing, then let me know)

ADD REPLY
0
Entering edit mode
$ awk 'FNR==NR {a[$0]++; next} {if ($2 in a) print $0}' file1.txt file2.txt
ADD REPLY

Login before adding your answer.

Traffic: 3527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6