Hi,
I'm starting a project where I need to use the sequence for a number of individuals so I have been given a number of large VCF files which I understand contain the differences from the reference genome. What I want to do is to recreate the sequence for each individual for particular genes - Is there already software available that will do this?
In any case I believe I understand the initial portion which describes what's present in the initial reference genome (e.g. A), it's position, and the potential alternatives (e.g. G,T), my confusion comes in when I look at the data describing the individuals.
If we take the example on line 3 here http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ (chrom 20, position 1110696, reference A , alternatives G,T ) It describes the personal information in a format of GT:GQ:DP:HQ, so for individual NA00001 this is 1|2:21:6:23,27. When I looked up information on the 1|2 portion my understanding is that these represents two alleles for this position, but in order to reconstruct, which should I be using in the sequence? For example if the reference geneome looks like this
"...ATGT A CTGA..." where the A here is at position 1110696, how would the genome for an individual look?
Is it the case that it would look like both "...ATGT G CTGA..." and "...ATGT T CTGA..." ? and is there a way in which I can determine which is the "correct" sequence to use?
I really appreciate any help people can offer me on this. Thanks.
Hello BioBaby,
could you please describe why you like to do this?
Both are correct. Or better: You need both to have the correct sequence that are present in the individual, as it has a diploid genome.
To create a new reference sequence taking variants into account, one can use bcftools consens. But again, please first clarify what's your goal.
fin swimmer
Thanks for getting back to me - the aim is to use these sequences as training data for a machine learning model where I only have the one reading of what I'm hoping to predict from each individual for each gene.
Thanks for linking that, I'll have a look at it now.