Question

Does vcftools or GATK have an option to skip indels when making alternate sequence?

0

Entering edit mode

7.7 years ago

severalorks ▴ 110

In my previous question, I asked: I would like to take a vcf file and a reference genome from the 1000Genomes project, and obtain a fasta file that lists the genomes for each individual in the vcf, according to the SNPs each individual has in the vcf file. Answers showed that GATK (FastaAlternativeReferenceMarker) and vcftools (vcf-consensus, http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/) were able to do something similar to this. However, I'd like to skip indels when creating these sequences. Do existing tools have options to do this?

vcf genome fasta alignment sequencing • 4.0k views

ADD COMMENT • link updated 7.7 years ago by Matt Shirley 10k • written 7.7 years ago by severalorks ▴ 110

score 1 · Answer 1 · 2016-07-22

1

Entering edit mode

7.7 years ago

Matt Shirley 10k

You might try just removing INDELs from your VCF file before passing to these tools, but you could also consider looking at the FastaVariant class in pyfaidx. See A: Make fasta file from SNPs in two vcf files for an example.

EDIT: The FastaVariant approach is currently too slow for whole chromosome access.

ADD COMMENT • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

I've tried parsing through the vcf files using python to remove indels but I've run it for days and it hasn't finished, and I'm not sure if it's a bug in my script or because there's 84.4 million SNPs to go through. I've looked at FastaVariant; does it create an alternate sequence like FastaAlternateReferenceMaker in GATK?

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

Yes, it creates a variant-substituted FASTA sequence that you can work with using the Fasta interface. It's only designed to handle SNPs, so in fact it does ignore INDELs (I thought this was a design flaw and was thinking about correcting it but maybe not...).

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

I ran the script and created a FastaVariant object. It works well but it outputs 1 alternative genome. I was wondering if I can use it to get an alternative sequence for each individual (so if I want the genomes of 100 individuals, including HG00097, from the vcf file, can pyfaidx do this?)

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

I think this script would do the trick:

I just realized I hadn't documented the use of the sample argument, so thanks for pointing this out!

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

Thank you! I will try it

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

I got this error:

Traceback (most recent call last):
  File "pyfaidx_test.py", line 11, in <module>
    sample_fasta.write(record.seq)
AttributeError: 'FastaRecord' object has no attribute 'seq'

What does it mean?

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

Oops. I didn't test before submitting. I just edited the Gist, so please try again.

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

What's the estimated run time for one individual? I've ran it for around 30 min and it hasn't given output yet for the first individual, HG00096. It's created the file but it's empty so far. I can continue running it for several more hours to see what happens, though I was wondering if I might be doing something if it's supposed to be faster.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

0

Entering edit mode

Also, just to understand how the code works better, does this give the same DNA sequence for each individual as what you wrote above? It hasn't finished running yet so I don't know:

from pyfaidx import FastaVariant
import vcf

samples = vcf.Reader(open('calls.vcf.gz', 'r')).samples
for sample in samples:
  with FastaVariant('reference.fa', 'calls.vcf.gz', sample=sample, het=True, hom=True) as consensus:
    with open(sample + '.fasta', 'w') as sample_fasta:
      sample_fasta.write('>' + sample)
      sample_fasta.write(str(consensus['20']))

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

That should give you chromosome 20 for each of the individuals, yes.

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

1

Entering edit mode

Thanks! It took 33 seconds for str(consensus['20'][0:100000]) to finish running, so I estimate it'll take 6 hours for str(consensus['20']) to finish running.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

0

Entering edit mode

LOL. I'm glad the code actually works, and granted the module wasn't purpose built for this, but it's still a bit embarrassing...

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

1

Entering edit mode

I hadn't ever tested fetching entire (large) chromosomes using FastaVariant, and it turns out that I really designed the class to support efficient random access of smaller subsequences. I'm testing your use case right now and you're correct - it's excruciatingly slow. The reason for this is that pysam is used to query every variant position on every chromosome for every individual. It will be much better to use a tool that is designed for this purpose. You might take a look at the recommendation from the 1000 Genomes team: http://www.1000genomes.org/faq/are-there-any-fasta-files-containing-1000-genomes-variants-or-haplotypes/

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

1

Entering edit mode

I've tried the option for 1000genomes, along with many other options (at least 6, spending hours on each option) but the vcf-subset doesn't work for me as it gives me 'Broken VCF header- no columns names?', and Data Slicer crashes (it says page taking too long to respond) when I try to use it to get an entire chromosome for 1 individual. So I think pyfaidx is the best option i have right now, until I figure out why vcf-subset doesn't work.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

0

Entering edit mode

Actually 100000 takes 30 seconds, while 500000 takes 7 minutes, so 6 hrs is far off. I've tracked how much the time increases for each increment i of 10000 for consensus[0:i]. It might take a day to run, I will see.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

It's going to depend on SNP density.

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

1

Entering edit mode

Maybe it would be faster if I was to take chunks of it, says str(consensus[0:10000]) + str(consensus[10000:20000]), and add them together. That way it sort of 'parallelizes' it.

EDIT: len(str(consensus['20'][100000:110000])) takes around 12 seconds len(str(consensus['20'][200000:210000])) takes around 12 seconds.

So maybe it could work? I'll test it out.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

Awesome!! I just ran it in parallel on the sun grid engine from 0 to a million and it all finished in less than 2 minutes!!

I ran the whole thing for 1 individual... finished in around 3-5 minutes. There were around 600 jobs. Next time I can use only 60 jobs and it'll probably finish in 30 min, and 6 jobs (10 mil each) for a couple of hours.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

Yes, the access times will be linear with the size of the query, independent of the position of the query in the genome. Glad you figured out a way to parallelize your job!

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

In the phased vcf file it gives the information for each diploid chromosome for each individual. For example, for column HG00096, it has 0|0, or 0|1, etc., where 1 indicates the chromosome has the alternative, while 0 means it has the reference SNP. Does the pyfaidx FastaVariant object only create it for 1 of the 2 chromosomes in a set? Is it possible to use it to get both chromosomes in a set (say, both chromosome 1s?)

I guess it may have something to do with line 790 in the __init__.py script?

if sample.gt_type in self.gt_type and eval(self.filter):

gt_type depends on 'het' and 'hom', which seem to have something to do with heterozygous/homozygous? I'm not sure how to interpret them.

ADD REPLY • link 7.7 years ago by severalorks ▴ 110

1

Entering edit mode

I think what you're asking for is full respect of phasing information, and currently pyfaidx ignores phasing completely. You only get one sequence out, and it's SNP-substituted using either hetero/homozygous combination of SNPs. Really you seem to want haplotypes output separately, but unless your entire call set is completely phased you won't get "just" 2 versions of every chromosome. In fact, for unphased calls you'll get an exponential growth in the number of possible chromosomes. That's why I ignore phrasing currently...

ADD REPLY • link 7.7 years ago by Matt Shirley 10k

0

Entering edit mode

In this thread: VCF to FASTA

You linked to a program called vcf2fasta.cpp. Would that do what I need?

EDIT: I wrote a simple python script that gets phased fasta from vcf for 1 individual, and it takes 5 min to run, so if I run it in parallel for all individuals I should get fastas

ADD REPLY • link 7.7 years ago by severalorks ▴ 110