search a large dbsnp file using another most efficiently
0
0
Entering edit mode
6.8 years ago
bioguy24 ▴ 230

I am trying to find the faster (most efficient) tool for searching a very large ~100GB file. The input that is searched for is file1, which is just a list of rs#'s in a column (1 per line) --- there may be several hundred ---. File2 is a sorted list of hg19 HGVS nomenclature for each SNP from dbSNP. I have tried a variety of grep, awk, and ack and they all seem to work but maybe there is a better approach. File2 is also the desired result in this case as well as I am just trying to return the lines in file2 that match file1. I read parallel may help so I tried that but maybe there is another solution. The awk is closer to what I need as it searched all of the lines in file1. Thank you :).

ack

ack "(^2307492)|(^7349185)" file2

grep

cat file2.txt | grep -P "\t7349185" | "\t2307492"

awk

awk 'NR == FNR { query[$0] = 1; next } query[$0]' file1 file2

awk

BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $1
val[$1][c] = $NF
next
}
$2 in val {
for (c=1; c<=num[$1]; c++) {
    if ( (beg[$1][c] = $2) ) {
        print $0, val[$1][c]
        break
    }
}
}

awk -f script.awk file1 file2

file1

rs7349185
rs2307492

file2

NC_000001.10:g.26131654G>A  7349185
NC_000001.11:g.25805163G>A  7349185
NG_009930.1:g.9988G>A   7349185
NM_020451.2:c.425G>A    7349185
NM_206926.1:c.323G>A    7349185
NP_065184.2:p.Cys142Tyr 7349185
NP_996809.1:p.Cys108Tyr 7349185
NC_000001.10:g.171168545T>C 2307492
NC_000001.11:g.171199406T>C 2307492
NM_001301347.1:c.-34+2595T>C    2307492
NM_001460.4:c.545T>C    2307492
NP_001451.2:p.Phe182Ser 2307492

parallel

parallel --pipe --block 2M grep 23074920| 7349185 < file2
ngs • 1.4k views
ADD COMMENT
1
Entering edit mode

you can also use grep -f file1 file2.

ADD REPLY
0
Entering edit mode

grep -f file1 file2 and awk -F'\t' 'NR==FNR{A[$1];next}$2 in A' also work but are pretty slow on these large files. Thank you :).

ADD REPLY
1
Entering edit mode

This strictly isn't a bioinformatics question though it uses informatics data. You may have better luck searching/asking stackoverflow/stackexchange. Here are example1, example2 of similar questions.

ADD REPLY
1
Entering edit mode

please sort your files on the desired columns and add a test for join

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6