VCF Format - how to create the underlying seqeuence
1
0
Entering edit mode
5.9 years ago
BioBaby • 0

Hi,

I'm starting a project where I need to use the sequence for a number of individuals so I have been given a number of large VCF files which I understand contain the differences from the reference genome. What I want to do is to recreate the sequence for each individual for particular genes - Is there already software available that will do this?

In any case I believe I understand the initial portion which describes what's present in the initial reference genome (e.g. A), it's position, and the potential alternatives (e.g. G,T), my confusion comes in when I look at the data describing the individuals.

If we take the example on line 3 here http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ (chrom 20, position 1110696, reference A , alternatives G,T ) It describes the personal information in a format of GT:GQ:DP:HQ, so for individual NA00001 this is 1|2:21:6:23,27. When I looked up information on the 1|2 portion my understanding is that these represents two alleles for this position, but in order to reconstruct, which should I be using in the sequence? For example if the reference geneome looks like this

"...ATGT A CTGA..." where the A here is at position 1110696, how would the genome for an individual look?

Is it the case that it would look like both "...ATGT G CTGA..." and "...ATGT T CTGA..." ? and is there a way in which I can determine which is the "correct" sequence to use?

I really appreciate any help people can offer me on this. Thanks.

VCF Sequence genome gene • 1.0k views
ADD COMMENT
2
Entering edit mode

Hello BioBaby,

What I want to do is to recreate the sequence for each individual for particular genes

could you please describe why you like to do this?

Is it the case that it would look like both "...ATGT G CTGA..." and "...ATGT T CTGA..." ? and is there a way in which I can determine which is the "correct" sequence to use?

Both are correct. Or better: You need both to have the correct sequence that are present in the individual, as it has a diploid genome.

To create a new reference sequence taking variants into account, one can use bcftools consens. But again, please first clarify what's your goal.

fin swimmer

ADD REPLY
0
Entering edit mode

Thanks for getting back to me - the aim is to use these sequences as training data for a machine learning model where I only have the one reading of what I'm hoping to predict from each individual for each gene.

Thanks for linking that, I'll have a look at it now.

ADD REPLY
0
Entering edit mode
5.9 years ago
d-cameron ★ 2.9k

Is it the case that it would look like both "...ATGT G CTGA..." and "...ATGT T CTGA..." ? and is there a way in which I can determine which is the "correct" sequence to use?

For germline human sequencing, both are correct because each person has two copies of each autosomal gene.

It's even more complicated as, unless your input files are fully phased, you don't know which variants are on which chromatid.

For example, if a person has A/T in position 2 and G/T in position 3 the two copies of the genes could have the sequences nAGn and nTTn or nTGn and nATn

How do you plan to handle this? The two options can have very different phenotypes even though they have the same set of variants.

Edit: your example indicates that the variants you have are indeed at least partially phased. You will need fully phased variants if you are to resolve the above ambiguity.

the aim is to use these sequences as training data for a machine learning model where I only have the one reading of what I'm hoping to predict from each individual for each gene.

It sounds very much like you need to use a different model as your current model is a poor reflection of reality of human genetics as is likely to perform very poorly.

ADD COMMENT

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6