Question

Extract in specific column with patter file

1

Entering edit mode

5.1 years ago

flogin ▴ 280

I have a file with 2 columns, the first column have a read ID from fastq, the second column have a ncbi taxi ID, like this:

ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067

and I have an archive with the taxid list of a specific taxos, like this:

I want to extract the lines with the taxid present on archive 2, but, some of the taxids presents the same number present on read ID. So I tried extract only the correct lines with fgrep, but, fgrep can't specify a column (in the case the column 2).

Anyone can help me? Briefly I need extract the whole line detecting a pattern in column 2 using a file with patterns.

Thanks !

grep awk • 1.0k views

ADD COMMENT • link updated 5.1 years ago by Kevin Blighe 87k • written 5.1 years ago by flogin ▴ 280

0

Entering edit mode

output:

$ join -1 1 -2 2 <(sort -k1 test1.txt) <(sort -k2 test.txt) -o 2.1,2.2
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468 512
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468 9606

input:

$ cat test1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

$ cat test.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067

ADD REPLY • link 5.1 years ago by cpad0112 21k

score 4 · Accepted Answer · 2019-03-14

4

Entering edit mode

5.1 years ago

Kevin Blighe 87k

Assuming tab-delimited files:

cat file1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

cat file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067


awk 'FNR==NR {a[$1]==$1; next} {if ($2 in a) print $0}' file1.txt file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512

ADD COMMENT • link 5.1 years ago by Kevin Blighe 87k

1

Entering edit mode

This worked, thanks a lot Kevin !!

ADD REPLY • link 5.1 years ago by flogin ▴ 280

0

Entering edit mode

You are welcome. Goodnight / Buenas noches (if you want an explanation of what the command is doing, then let me know)

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

$ awk 'FNR==NR {a[$0]++; next} {if ($2 in a) print $0}' file1.txt file2.txt

ADD REPLY • link 5.1 years ago by cpad0112 21k