Question

how can I keep only those SNP in file "a" which match to SNPs in file "b"?

0

Entering edit mode

6.8 years ago

Ana ▴ 200

I have a file that I have filtered my SNPs for LD (in the example below;my.filtered.snp.id). I want to keep only these SNPs in my genotype matrix (geno_snp), I figured how to do it in R but the file is so big (~7GB) can not load it in R. I want to keep those lines (the whole line including snp.id and genotype information) in the genotype matrix where snp.id matches with snp.id in my my.filtered.snp.id and delete those that are not match. Can anyone help me to know how can I do it in Perl or python?

head my.filtered.snp.id)
 Chr10_31458
 Chr10_31524
 Chr10_45901
 Chr10_102754
 Chr10_102828
 Chr10_103480

head (geno_snp)
XRQChr10_103805 NA NA NA 0 NA 0 NA NA NA NA NA 0 0
XRQChr10_103937 NA NA NA 0 NA 1 NA NA NA NA NA 0 2
XRQChr10_103990 NA NA NA 0 NA 0 NA NA NA NA NA 0 NA

row genotype file • 1.7k views

ADD COMMENT • link updated 6.8 years ago by WouterDeCoster 47k • written 6.8 years ago by Ana ▴ 200

0

Entering edit mode

6.8 years ago

DVA ▴ 630

You could use dictionary in python.

#create dictionary using your filtered.id file
idDict={}
with open("my.filtered.snp.id", "r") as input:
    for line in input:
        #Assume your line ends with "\n" - if it ends with "\r\n" please change the code below accordingly.
        key = line.split("\n")[0]
        if key not in idDict:
            idDict[key]=0

#for your matrix, compare against the dictionary
with open("geno_snp","r") as matrix, open ("output","w") as matrix_output:
    for line in matrix:
        #assume your deliminator is "\t"
        SNPID = line.split("\t")[0]
        #to exclude "XRQ" --- extract the string from the 4th character
        SNPID = SNPID[3:]

        if SNPID in idDict:
            matrix_output.write(line)

ADD COMMENT • link 6.8 years ago by DVA ▴ 630

1

Entering edit mode

Good to see some Python code here :) Thanks for contributing!
But what's the added value of using a dictionary? Why not just a list?
Also notice that you don't have to check if key in idDdict: duplicates don't make a difference.

My adaptation of your code:

#create list using your filtered.id file
with open("my.filtered.snp.id", "r") as input:
    idlist = [line.strip() for line in input]

#for your matrix, compare against the list
with open("geno_snp","r") as matrix, open ("output","w") as matrix_output:
    for line in matrix:
        if line.split(" ")[0][3:] in idlist:
            matrix_output.write(line)

Also, this: SNPID = line.split("\n")[0] is a bit odd. Why would you split on newlines? I don't think this line works as you think it does. You need to split on the field separator (which is looks like a space in OPs example).

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

You are right about the line. I edited my answer. As for dictionary, I'm used to it to avoid any duplicated record in filtered.id file. That said, I guess even there are duplicated records, it does not matter in this case. Thank you for your reply.

ADD REPLY • link 6.8 years ago by DVA ▴ 630

score 4 · Accepted Answer · 2017-07-11

4

Entering edit mode

6.8 years ago

WouterDeCoster 47k

I would suggest using

grep -f my.filtered.snp.id geno_snp >tadaam.filtered.txt

ADD COMMENT • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes, grep can easily do the job for me. I did not try Python scripts though .... thanks guys

ADD REPLY • link 6.8 years ago by Ana ▴ 200