Biostar Beta. Not for public use.
how to count variants par sample per chromosome in a vcf file?
0
Entering edit mode
15 months ago
nagarsaggi • 0

I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram

snp • 1.2k views
ADD COMMENTlink
1
Entering edit mode

Use vcfstats function from Rtgtools per sample stats

ADD REPLYlink
1
Entering edit mode
12 months ago
sacha ♦ 1.7k
France

Split your vcf file by sample and count how many times chromosom appear in each file .

FILE=yourfile.vcf
for sample in `bcftools query -l $FILE`
do 
      bcftools view -c1 -H -s $sample -o ${sample}.vcf $FILE
      cat ${sample}.vcf |cut -f1|uniq -c > ${sample}.count
done
ADD COMMENTlink
1
Entering edit mode
11 months ago
JC 7.9k
Mexico

You can use a bash command like:

gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c

Explanation:

  • gunzip -c myfile.vcf.gz -> decompress the VCF
  • grep -v "#" -> removes all header lines
  • cut -f1 -> keeps only the first column (which is the chromosome id)
  • uniq -c -> counts unique elements in a list
ADD COMMENTlink
0
Entering edit mode

Thank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.

ADD REPLYlink
0
Entering edit mode

you can extract samples per individual with VCFtools, then use the same strategy to count called variants

ADD REPLYlink
0
Entering edit mode

Yes, I tried this but still get same number of variant per site for all individual samples.

ADD REPLYlink
0
Entering edit mode

across the samples.

this was not specified in your original question. How are we supposed to know it ?

I updated the question.

ADD REPLYlink
1
Entering edit mode
13 months ago
France/Nantes/Institut du Thorax - INSE…

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  otavirus_rf.vcf.gz

RF07    S4:1
RF07    S3:2
RF05    S2:4
RF07    S2:2
RF11    S4:1
RF03    S5:2
RF09    S1:1
RF09    S2:1
RF01    S5:1
RF09    S3:1
RF03    S2:4
RF03    S1:1
RF09    S5:1
RF05    S3:4
RF03    S4:2
RF07    S5:1
RF05    S4:2
RF03    S3:4
RF06    S3:1
RF06    S2:1
RF04    S1:3
RF06    S1:1
RF10    S4:2
RF10    S1:1
RF02    S4:1

details:

stream(). /* get a stream of variant */
    flatMap( 
        V->V.getGenotypes(). /* get the genotypes */
            stream(). /* convert to a stream of genotypes */
            filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
            map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
        ).
    collect( 
        Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
        ).forEach((K,C)->println(K+" : "+C)); /* print */
ADD COMMENTlink
0
Entering edit mode
12 months ago
prasundutta87 • 330

BCFtools stats with any one of the below two options is what you need..

 -s, --samples <list>               list of samples for sample stats, "-" to include all samples
 -S, --samples-file <file>          file of samples to include

Example:

bcftools stats -s - <multisample VCF file>
ADD COMMENTlink
0
Entering edit mode
13 months ago
United States

(deleted, initially missed the fact that hom-ref calls are not wanted here)

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1