Biostar Beta. Not for public use.
How to find the sites/lines in vcf that are fixed for alleles in two different populations?
0
Entering edit mode
15 months ago
kirannbishwa01 • 1000
United States

Say I have two multi-sample vcf files from two different populations. I want to find the sites (CHROM, POS and alleles) in these vcf files that are fixed for different alleles, which could be REF vs. ALT or ALT1 vs. ALT2.

I was thinking about using pyVCF and comparing all samples to all samples (in Pop1 vs. 2), but it would turn out to be a quite a lengthy and frustrating task.

Any suggestions. Any one done this before or have any proposed solution.

Thanks,

Post Edit:

Lets say we have,

vcf_pop1 = 

#CHROM    POS    FORMAT    pop1_A    pop1_B    pop1_C        (I am skipping other headers)
2    26       GT             0/0           0/1           1/2
2    37       GT             1/1           1/1           1/1
2    105       GT             0/0           0/1           0/2
2    206       GT             0/0           0/0           0/0
2    305       GT             0/0           0/1           1/2
2    517       GT             2/2           2/2           2/2

you can see that alleles at position 37, 206 and 517 are fixed in this population


vcf_pop2 = 

#CHROM    POS    FORMAT    pop1_A    pop1_B    pop1_C        (I am skipping other headers)
2      29       GT             0/1           1/1           1/2
2      37       GT             0/0           0/0           0/0
2      105       GT             0/0           0/1           0/2
2      119       GT             0/1           1/2           1/2
2      218       GT             2/2           2/2           2/
2      305       GT             0/0           0/1           1/2
2      517       GT             1/1           1/1           1/1

Here alleles at position 37, 218 and 517 are fixed.

But, the two population are alternately fixed at position 37 and 517.

ADD COMMENTlink
0
Entering edit mode

I want to find the sites (CHROM, POS and alleles) in these vcf files that are fixed for different alleles, which could be REF vs. ALT or ALT1 vs. ALT2.

This is not clear to me, can you give an example?

ADD REPLYlink
0
Entering edit mode

Lets say we have,

vcf_pop1 = 

#CHROM    POS    FORMAT    pop1_A    pop1_B    pop1_C        (I am skipping other headers)
2                 26       GT             0/0           0/1           1/2
2                 37       GT             1/1           1/1           1/1
2                 105       GT             0/0           0/1           0/2
2                 206       GT             0/0           0/0           0/0
2                 305       GT             0/0           0/1           1/2
2                 517       GT             2/2           2/2           2/2

you can see that alleles at position 37, 206 and 517 are fixed in this population


vcf_pop2 = 

#CHROM    POS    FORMAT    pop1_A    pop1_B    pop1_C        (I am skipping other headers)
2                 29       GT             0/1           1/1           1/2
2                 37       GT             0/0           0/0           0/0
2                 105       GT             0/0           0/1           0/2
2                 119       GT             0/1           1/2           1/2
2                 218       GT             2/2           2/2           2/
2                 305       GT             0/0           0/1           1/2
2                 517       GT             1/1           1/1           1/1

here alleles at position 37, 218 and 517 are fixed.

But, the two population are alternately fixed at position 37 and 517.

Is it clear now ?

ADD REPLYlink
0
Entering edit mode

I see.

But aren't the populations also "alternatively fixed" for position 218? It's reference in pop1 (but not in the vcf) and variant in pop2.

ADD REPLYlink
0
Entering edit mode

@WouterD:

Yes, the allele at position 218 for pop1 could be reference (0/0) or no call (./.). One funny thing about how several VCFcallers work is that: the reference 0/0 is only emited if at least one other sample (in the pool of samples that were called together) had alternate allele/genotype.

Since, the original vcf was called using all the samples of both populations (supplying multiple bam files), and then later separated into different vcf by population. Then nocall GT = ./. were dropped in each population vcf. This suggests that 218 in pop1 was nocall (./.). But, without explaining what I really did, I know its hard to tell what was what.

But, is there a method to called the alternately fixed vcf given this data?

I was thinking if there was a tool to work out this problem (like inside GATK, FreeBayes, SAMTOOLs, etc. etc.) and if the community know about it, before I try to reinvent the wheel.

Btw, did you get the other comment I posted in your other question/answer.

Thanks,

ADD REPLYlink
1
Entering edit mode

I'm not aware of tool which does this directly (although it's definitely possible it exists).

However, I think writing your own python version wouldn't be too bad. If you would

  1. Loop over each file, only keeping lines which are fixed in that population. Create a dictionary in which key=position and value=allele (0, 1 or 2)
  2. Afterwards, look for duplicate positions between both dictionaries (keys). If they have the same allele (=value) you need to discard them from both dictionaries
  3. All position which you have left in both dictionaries are both fixed and unique to that population
ADD REPLYlink
0
Entering edit mode

I had this same thing my mind. But, I am too tired of writing small program for each small problem. I was hoping that such tool already existed, all I need is to find them and put it in my pipeline. I got some response from Perrie, hope it works out.

Thanks again,

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1