Help needed parsing VCF files
1
1
Entering edit mode
9.9 years ago
devenvyas ▴ 740

Hello, I need to collect a set of ~330,000 SNPs from a set of extended-VCF files (available here and here). The only information I have on these SNPs are there dbSNP rs#s

I've tried to do this using vcftools, but vcftools f'ed up the extended format aspects of the VCF files, and thus the output was useless.

I have been told by one of the individuals who generated these original files that I should NOT use vcftools or GATK, and I SHOULD use a simple VCF parser (SAMtools library/tabix implementation, e.g. pysam). I am fairly noob, and I don't know how to do this or most of what this means. I know some Perl, Python, and bash in case it is relevant. Can anyone help me figure out what I should be doing here? Thanks!

-Deven

SNP tabix VCF rs • 4.1k views
ADD COMMENT
0
Entering edit mode

By collect a set of snps do you mean grep the line? Do you have genomic positions corresponding to the SNPs in your list, or just the rsID (could be problematic depending on your dbsnp build)

If this is sequencing data, I would avoid plink. It just breaks too much...

Pretty sure I can help once I know this. Good luck.

ADD REPLY
0
Entering edit mode

I just have the rsIDs. The VCF files fortunately do have the rsIDs listed for most or all variants that have one.

In regards to "By collect a set of snps do you mean grep the line?" essentially yes.

Step 1 is to cull out the lines I don't need and keep the lines I do need and keep these lines unmodified (i.e., I need the VCF header lines and the lines for the 330k SNPs)

Step 2 is to filter down the SNPs based on any possible stand mis-identifications or post-mortem modifications due to ancient DNA

Steps 3 and 4 and beyond, I don't want to start thinking about yet.

ADD REPLY
0
Entering edit mode

OK - Just a few more things.

Are you certain the rsIDs are on the same build? The answer to that is almost certainly no, but I have to ask.

If you can be sure the rsIDs will correspond, the problem is simple. If not, we need to do it by genomic position.

ADD REPLY
0
Entering edit mode

All the rsIDs in my list are extracted from the same SNP array (Illumina HumanCNV370-Quad). The VCF file sets are using the Hg19 genome build as a reference. I don't know what dbSNP build the SNP chip's list comes from, and I don't know what build the rsID labels in the VCF files come from either.

ADD REPLY
0
Entering edit mode

How often do rsIDs move? The notation for genomic position might change, but my understanding (which very well could be wrong) is once a polymorphism at a given nucleotide is given an rsID, that rsID cannot change unless it turns out one site has been given more than one rsID

ADD REPLY
0
Entering edit mode
9.9 years ago

I don't know whether it'll prove to be simple enough, but you can try PLINK 1.9's --vcf.

ADD COMMENT
0
Entering edit mode

I don't know how to use PLINK, if I were to use it, how would I filter SNPs based on rs# with it?


Anyways, from what I see here it https://www.cog-genomics.org/plink2/input#vcf, I doubt it is will work.

ADD REPLY
0
Entering edit mode

If you need to retain information from the VCF file beyond just SNP calls, yes, it won't work.

But if all you need are SNPs:

plink --vcf [name of vcf file, could be gzipped] --extract [name of text file with rs# list] --make-bed

will create a PLINK-format fileset which stores the SNP calls far more efficiently than VCF, and many programs you'd want to use to handle pure SNP data are capable of reading this format.

ADD REPLY
0
Entering edit mode

Thanks for the reply! I need to retain a lot of information from the VCF files, so while I probably will ultimately need to turn it in to a BED file, this stage is too soon.

The VCFs are ancient DNA, so after filtering down to the rs#s I have, I need to filter out any possible strand mis-identification SNPs and possible deamination sites (which is also something I currently don't know how to do) and the way the modern human reference data is stored in there, it would be lost with PLINK.

Any other ideas?

ADD REPLY
0
Entering edit mode

Hmm, this does sound like an annoying situation, since my usual suggestion at this point would be "use vcftools/GATK"... you might need to write a specialized VCF parser. Fortunately, while the full VCF specification is pretty complicated, it sounds like the parts that are actually relevant to your work should be manageable with a few hundred lines of Perl or Python.

As for deamination, see http://en.wikipedia.org/wiki/Transition_%28genetics%29 and http://en.wikipedia.org/wiki/Deamination .

For the record, I probably would use PLINK for most of this workflow. --a2-allele is usually a (clunky, but workable) solution to the reference sequence issue, --flip-scan can generally take care of strand issues, and --recode [text format of choice] + a short script should identify possible deaminations.

ADD REPLY
0
Entering edit mode

The deal is, the folks at the Max Planck made the source files using custom Python scripts. The reference and alternate columns refer the one or two sites found in the extinct hominid in question, the human reference genotype was redirected into a non-standard info fields, which makes the reference issue very difficult.

For example, if at a given site the Neanderthal had two Gs (regardless what the polymorphism is in modern humans), then they would report G as the reference and . as the alternative and the human alleles would be included somewhere in the info field for that line.

While I know some Python and Perl, writing a few hundred lines of code is beyond my current capabilities. I think I am going to email my contact back and try to get clearer instructions on how I am supposed to do this...

ADD REPLY

Login before adding your answer.

Traffic: 2949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6