Obtain the LD matrix from a list of SNP
4
1
Entering edit mode
6.7 years ago
janhuang.cn ▴ 210

I have a list of SNPs, and I would like to generate an LD matrix for it.

Below is an example. I obtained the R2 from SNAP [http://archive.broadinstitute.org/mpg/snap/ldsearch.php]. But this will take a lot of time since I have more than a couple of lists of SNPs.

I have also searched on Ensembl, but it seems that I can only search for the LD between two SNPs each time. [Example : http://www.ensembl.org/Homo_sapiens/Variation/HighLD?db=core;r=7:127081784-127082784;v=rs2283094;vdb=variation;vf=1684883;second_variant_name=rs2283095]

I have also looked at the LDheatmap package in R. But it seems the calculation of LD is based on a SNP information from a study, i.e. one need to provide the genotype of each participants. The snpStats package in Bioconductor seems to serve the same purpose. So these two methods do not seem to fit my situation.

Is there any way I could extract the LD matrix (R2) for CEU population, based on HapMap 22 dataset or 1000G dataset? (Considering I have a lot of lists of SNPs)

Example:

SNP
rs2283094
rs2283095
rs6467111

LD matrix (R2)
                     rs2283094    rs2283095    rs6467111
    rs2283094            1          0.143         0.042 
    rs2283095          0.143          1           0.297 
    rs6467111          0.042        0.297           1
SNP LD • 14k views
ADD COMMENT
2
Entering edit mode
6.7 years ago
aays ▴ 180

If you'd like to use data from the 1000 Genomes project, it's possible to download VCF files and use PLINK 1.9, an extremely fast command line tool for working with genomic data. Briefly, the workflow would be something along the lines of:

  1. Use --make-bed to create binary fileset from the VCF;
  2. Create a file listing all the variants of interest;
  3. Use PLINK's LD functions (namely --r2) on the binary fileset, providing your variant list as input to the --ld-snp-list argument.

PLINK primarily allows for calculation of the r2 statistic, but can can also return pairwise D or D' if specified. These can be returned in either table or matrix format. Also - I see that you're also looking for SNPs where r2 > 0.8. In this case, you can use the --ld-window-r2 argument in the --r2 command to filter out any SNP pairs with r2 below that value.

ADD COMMENT
0
Entering edit mode

I downloaded the 1000 Genomes phase 3 VCF files from this FTP folder [ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/]. I wonder if the one named " " covers chr 1-22? I suspect "wgs" means whole genome, but it seems to me the file size not that big, comparing to the separated files of chr1 to chr22.

I also try plink on my PC, I already put the plink file in the same folder where I store the VCF file. Then I use this command line

plink --vcf ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --make-bed --out binary_fileset

It returns error: Failed to open ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf

Any idea what is wrong?

ADD REPLY
0
Entering edit mode
6.7 years ago
Emily 23k

You can get linkage across a region in Ensembl.

ADD COMMENT
0
Entering edit mode

But the SNPs I have may not be on the same chromosome.

ADD REPLY
0
Entering edit mode

Then they won't be in LD

ADD REPLY
0
Entering edit mode

Can I get the linkage across the entire chromosome in Ensembl? It seems that it can not display since "The region you have selected is too large to display linkage data; a maximum region of 75kb is allowed. Please change the region using the navigation controls above."

ADD REPLY
0
Entering edit mode

The web-page has a limit of 75kb and the REST-API endpoint. In principle, there is no size limit on getting LD using the Perl API, but in practice the load would be so much that your computer would just explode.

The limits match up to what we actually think will be in LD. Beyond that, variants are so unlikely to be in LD then there's no point in pretending to calculate it.

ADD REPLY
0
Entering edit mode

Thank you. It is good to know this. But in my case, I have a list of SNPs (may be from different chromosomes), and I want to obtain the LD between them (those in the same chromosomes). So I think I have to obtain this from a comprehensive source, since I have too many SNPs, I do not want to manually search it on any website.

ADD REPLY
0
Entering edit mode

How big is your list? Do you just want to compare them to each other? You could use the Perl or REST APIs to get pairwise LD for each (perhaps with a chromosome filter) and print to a table of your own design.

ADD REPLY
0
Entering edit mode

If I have only two lists, each with less than 10 SNPs, I can manually search on Ensemble. But I have more than a hundred lists, and each list may have different number of SNPs. I have two purposes: 1) To generate the LD matrix for each list of SNPs; 2) To identify SNPs with R2>= 0.8 within each list.

ADD REPLY
0
Entering edit mode
5.9 years ago
zx8754 11k

We can use LDlink, either use their website LDlink.

Or use commandline, example below. More info about programmatic access see here:

To get D' prime:

$ curl -k -X GET 'https://analysistools.nci.nih.gov/LDlink/LDlinkRest/ldmatrix?snps=rs2283094%0Ars2283095%0Ars6467111&pop=CEU&r2_d=d'
RS_number       rs2283094       rs2283095       rs6467111
rs2283094       1.0             1.0             1.0
rs2283095       1.0             1.0             1.0
rs6467111       1.0             1.0             1.0

And for R2:

$ curl -k -X GET 'https://analysistools.nci.nih.gov/LDlink/LDlinkRest/ldmatrix?snps=rs2283094%0Ars2283095%0Ars6467111&pop=CEU&r2_d=r'
RS_number       rs2283094       rs2283095       rs6467111
rs2283094       1.0             0.188           0.329
rs2283095       0.188           1.0             0.062
rs6467111       0.329           0.062           1.0
ADD COMMENT
0
Entering edit mode

Hi,

how I would perform the above curl command on a list of snp's say it's called: rs.txt to get R^2 ? (there is 383 snps in that list)

Thanks Ana

ADD REPLY
0
Entering edit mode

You could loop through the list, or when you have a list of SNPs, better just to download the 1000 genome region, and calculate LD locally. This solution is good when we have small number of SNPs.

ADD REPLY

Login before adding your answer.

Traffic: 2188 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6