Question

Export Allele Frequency Of All Alleles In The Whole Data For A Subsets Of Individuals

0

Entering edit mode

13.0 years ago

Jianfengmao ▴ 310

Background: I am new to genomics and have few programing skills. I depends on VCF and VCFTools for my genomic analysis.

I have my population genomic data in VCF format which is popular in human 1000 genome project. By using VCFTools, I can get allele frequency statistics for each locus and this allele frequency is summarized by all the individuals in the VCF file.

my question:
I want to prepare an input file for some population genetic software, like BayeScan (http://cmpg.unibe.ch/software/bayescan/index.html) to predicting selection along the genome. I need to export an allele frequency statistics for each populations (a group of individuals), and the allele frequency for the same loci (genomic position) should have the same alleles and in the same order.

But, the allele frequency from VCFTools do not satisfy my need. For some ZERO count allele, it outputs nothing. I need the allele frequency was output for all alleles in the locus for all the populations.

I would like to hearing your directions on my problem. All who depends on VCF will benefit from your suggestions.

Thanks in advance.

Example:

This is an example for my data, here euro, ice, norA are three populations. For the position, 1, 3, 32 there are frequency for two allele each in population euro, but only one for the other two populations. I need the Zero count alleles be output as 0 not nothing.

abt6-mao-mbp:BayeScan jianfengmao$ head euro_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 2 9 T:8 A:1

1 3 2 9 A:8 G:1

1 32 2 9 C:7 G:2

1 69 1 9 A:9

1 75 1 9 G:9

1 87 1 9 C:9

1 88 1 9 T:9

1 116 1 9 A:9

1 141 2 9 C:0 T:9

abt6-mao-mbp:BayeScan jianfengmao$ head ice_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 3 T:3

1 3 1 3 A:3

1 32 1 3 C:3

1 69 1 3 A:3

1 75 1 3 G:3

1 87 1 3 C:3

1 88 1 3 T:3

1 116 1 3 A:3

1 141 2 3 C:0 T:3

abt6-mao-mbp:BayeScan jianfengmao$ head norA_chr1.frq.count

CHROM POS NALLELES NCHR {ALLELE:COUNT}

1 1 1 6 T:6

1 3 1 6 A:6

1 32 1 7 C:7

1 69 1 6 A:6

1 75 2 6 G:5 A:1

1 87 1 6 C:6

1 88 1 6 T:6

1 116 1 6 A:6

1 141 2 7 C:0 T:7

vcf vcftools • 6.4k views

ADD COMMENT • link updated 13.0 years ago by lh3 33k • written 13.0 years ago by Jianfengmao ▴ 310

3

Entering edit mode

Hi jianfengmao, this is your 9th questions on Biostar. Could you please validate (or comment) the answers you received for your previous questions ?

ADD REPLY • link 13.0 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Also, several of your questions, including this one, can be easily solved by writing a script. As you have been in this field for at least several months, I would not call you a newcomer. You should really learn a whatever scripting language and solve the problem by yourself. This is your work.

ADD REPLY • link 13.0 years ago by lh3 33k

0

Entering edit mode

Ih3, Your suggestion is just what I want and what I am doing now. I intend to do that by myself and I used all my time on that. And, I have gotten much till now. But, I have not learn everything I need. I have none training on computer and programming, I am sorry for my so many questions here, really sorry. I will try my best to do that by myself. I will learn programming in perl and python, like I started to learn R four years ago in an isolated environment in China. Thanks for what I have gotten from you all.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310

0

Entering edit mode

Pierre Lindenbaum, I will do what you have mentioned. Thanks for your kindness.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310

score 2 · Answer 1 · 2011-04-11

Have a look at a scripting language that allows you to easily read your data files line-by-line. Then split the lines into columns and evaluate the data in those columns respectively. You can use a hash-map to keep track of the counts for each allele (allele names become keys, their accumulated frequency is the respective value).

For Python, it would look a bit like this:

read the file using a for-loop, see "An alternative approach to reading lines is to loop over the file object.", http://docs.python.org/tutorial/inputoutput.html
split the string that represents the line's contents using "split", http://docs.python.org/library/string.html
keep track of allele-to-frequency mappings using a dictionary, see "5.5. Dictionaries", http://docs.python.org/tutorial/datastructures.html

Regarding the frequency count, you would use the hash-map/dictionary as follows (I call the Python dictionary frequencies):

you can check whether you have stored a frequency for an allele x already using if _x_ in frequencies:
if you have a frequency for that allele stored already, do accumulated_count = frequencies[_x_] followed by frequencies[_x_] = accumulated_count + count_of_this_line
if there is no frequency stored for that allele yet, then do frequency[_x_] = count_of_this_line

score 1 · Answer 2 · 2011-04-11

1

Entering edit mode

13.0 years ago

lh3 33k

This Perl script takes a VCF and a sample-population pair list as input and outputs the allele count at each site for each population (you need the -s option). The Lua version has a few comments. Example:

zcat ALL.2of4intersection.20100804.genotypes.vcf.gz | vcf2afs.pl -s 20100804.ALL.panel -

This DOESN'T answer your question, but you can learn from the script.

ADD COMMENT • link 13.0 years ago by lh3 33k

0

Entering edit mode

Dear Ih3, Thanks a lot for your kindness. I was encouraged by your former comments.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310