Biostar Beta. Not for public use.
Question: Extract in specific column with patter file
1
Entering edit mode

I have a file with 2 columns, the first column have a read ID from fastq, the second column have a ncbi taxi ID, like this:

ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067

and I have an archive with the taxid list of a specific taxos, like this:

2485233
2485231
2059665
2029516
2022430
2022429
1987726
1980986
1738445
1737346

I want to extract the lines with the taxid present on archive 2, but, some of the taxids presents the same number present on read ID. So I tried extract only the correct lines with fgrep, but, fgrep can't specify a column (in the case the column 2).

Anyone can help me? Briefly I need extract the whole line detecting a pattern in column 2 using a file with patterns.

Thanks !

ADD COMMENTlink 11 months ago flogin • 110 • updated 11 months ago Kevin Blighe 43k
Entering edit mode
0

output:

$ join -1 1 -2 2 <(sort -k1 test1.txt) <(sort -k2 test.txt) -o 2.1,2.2
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468 512
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468 9606

input:

$ cat test1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

$ cat test.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067
ADD REPLYlink 11 months ago
cpad0112
11k
4
Entering edit mode

Assuming tab-delimited files:

cat file1.txt 
2485233
2485231
2059665
2029516
512
2022430
2022429
9606
1987726
1980986
1738445
1737346

cat file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19451:10468  14230
ST-E00243:601:HWJFHCCXY:8:1101:19471:10468  10468
ST-E00243:601:HWJFHCCXY:8:1101:19492:10468  1512
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ST-E00243:601:HWJFHCCXY:8:1101:19532:10468  1421067


awk 'FNR==NR {a[$1]==$1; next} {if ($2 in a) print $0}' file1.txt file2.txt 
ST-E00243:601:HWJFHCCXY:8:1101:19431:10468  9606
ST-E00243:601:HWJFHCCXY:8:1101:19512:10468  512
ADD COMMENTlink 11 months ago Kevin Blighe 43k
Entering edit mode
1

This worked, thanks a lot Kevin !!

ADD REPLYlink 11 months ago
flogin
• 110
Entering edit mode
0

You are welcome. Goodnight / Buenas noches (if you want an explanation of what the command is doing, then let me know)

ADD REPLYlink 11 months ago
Kevin Blighe
43k
Entering edit mode
0
$ awk 'FNR==NR {a[$0]++; next} {if ($2 in a) print $0}' file1.txt file2.txt
ADD REPLYlink 11 months ago
cpad0112
11k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0