Allelic Distance calculation using PED format data
1
0
Entering edit mode
8.7 years ago
aritra90 ▴ 70

Hi,

I need to find the pairwise allelic distance between individuals from a PLINK (PED/MAP) file. I have written a Python code for it and it's taking a long time to compute the similarity distance matrix. My input file is huge (~25 GB) with 5000 individuals and ~2.6 million SNPs. I couldn't find any software package to compute the allelic distance and hence I decided to write my own code. This should be pretty straight forward in Matlab, but, because of the sheer volume of the data, I decided to go with Python (Though, I'm very new to Python and learning my trade around it). I am stating the problem here:

I have a file as such:

FAM001  1  0 0  1  2  A A  G G  A C
FAM002  2  0 0  1  2  A A  A G  0 0

And the allelic distance metric is:

For each SNP count how many alleles differ between two samples at the same locus. If the first sample is AA and the second is AG, then the allelic distance is 1, if the first is AA and the second is GG, then it is two, if they are both the same, it is zero, etc. After finding this for all the locus which is about (~1.3 mil) Sum up over all SNPs and this should be the allelic distance. If there are missing entries in either sample, ignore that SNP. To normalize, we can divide the sum by the number of SNPS taken to compute this, to maintain generality.

My code is this:

import numpy as np
import difflib

lines = np.genfromtxt('data_ped.txt')
s = (len(lines),len(lines))
mat = np.zeros(s)

for I in np.arange(len(lines)):
    val = np.split(lines[i],2)
    j = i+1
    for j in (np.arange(len(lines)-1)):
        c_val = []
        newval = np.split(lines[j],2)
        for k in np.arange(len(val)):
            matching = difflib.SequenceMatcher(None,a=val[k], b=newval[k])
            match = matching.find_longest_match(0,len(matching.a),0,len(matching.b))
            count = len(matching.a[match.a:match.a+match.size])
            c_val.append(count)

        (mat[i])[j] = sum(c_val)/len(val)

np.savetxt('opma.txt',mat)

This is taking awfully long to compute and I am not too sure whether this works as well. It'd be great if someone can interpret the problem and help me with this. Or, if anyone has any knowledge about a software that I can use to compute this, that'd be great as well.

Thanks!

python Allele plink SNP genome • 3.7k views
ADD COMMENT
0
Entering edit mode

Hello aritra90!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/31767307

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Thanks for bringing my attention to this. I wasn't sure where to post this question as I haven't used the forums extensively before. Sorry about that. I'll remove the other post.

ADD REPLY
3
Entering edit mode
8.7 years ago

With PLINK 1.9:

plink --file [.ped/.map prefix] --distance 1-ibs flat-missing

(I suggest converting to PLINK binary format first, then using --bfile; otherwise PLINK will create binary-format temporary files anyway and throw them away at the end of the computation.) See https://www.cog-genomics.org/plink2/distance#distance for more options.

ADD COMMENT
0
Entering edit mode

I checked this earlier, but was skeptical. I see you use Hamming Distance, which is kind of what I'm looking at as well. You think this will do the exact same thing, what I intend for? Thanks for the link, btw :)

P.S: Are you the developer of PLINK 1.9? (Just guessing, seeing your username)

ADD REPLY
0
Entering edit mode

PLINK 1.9 is insanely faster than the original PLINK, its still running the IBS distance and it has been about 14 hours. But, this took approx 7-8 mins to compute the distance matrix. Thank you for the link again.

ADD REPLY
0
Entering edit mode

Yes, I am the developer.

1.9 actually started out as a pure distance matrix calculator, so that part of the program is especially well-optimized.

ADD REPLY

Login before adding your answer.

Traffic: 2932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6