This is a beta test.
Question: how to count variants par sample per chromosome in a vcf file?
0
Entering edit mode

I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram

ADD COMMENTlink 15 months ago nagarsaggi • 0 • updated 15 months ago chrchang523 5.2k
Entering edit mode
1

Use vcfstats function from Rtgtools per sample stats

ADD REPLYlink 15 months ago
cpad0112
11k
1
Entering edit mode

Split your vcf file by sample and count how many times chromosom appear in each file .

FILE=yourfile.vcf
for sample in `bcftools query -l $FILE`
do 
      bcftools view -c1 -H -s $sample -o ${sample}.vcf $FILE
      cat ${sample}.vcf |cut -f1|uniq -c > ${sample}.count
done
ADD COMMENTlink 15 months ago sacha ♦ 1.7k
1
Entering edit mode

You can use a bash command like:

gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c

Explanation:

  • gunzip -c myfile.vcf.gz -> decompress the VCF
  • grep -v "#" -> removes all header lines
  • cut -f1 -> keeps only the first column (which is the chromosome id)
  • uniq -c -> counts unique elements in a list
ADD COMMENTlink 15 months ago JC 7.9k
Entering edit mode
0

Thank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.

ADD REPLYlink 15 months ago
nagarsaggi
• 0
Entering edit mode
0

you can extract samples per individual with VCFtools, then use the same strategy to count called variants

ADD REPLYlink 15 months ago
JC
7.9k
Entering edit mode
0

Yes, I tried this but still get same number of variant per site for all individual samples.

ADD REPLYlink 15 months ago
nagarsaggi
• 0
Entering edit mode
0

across the samples.

this was not specified in your original question. How are we supposed to know it ?

I updated the question.

ADD REPLYlink 15 months ago
Pierre Lindenbaum
120k
1
Entering edit mode

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  otavirus_rf.vcf.gz

RF07    S4:1
RF07    S3:2
RF05    S2:4
RF07    S2:2
RF11    S4:1
RF03    S5:2
RF09    S1:1
RF09    S2:1
RF01    S5:1
RF09    S3:1
RF03    S2:4
RF03    S1:1
RF09    S5:1
RF05    S3:4
RF03    S4:2
RF07    S5:1
RF05    S4:2
RF03    S3:4
RF06    S3:1
RF06    S2:1
RF04    S1:3
RF06    S1:1
RF10    S4:2
RF10    S1:1
RF02    S4:1

details:

stream(). /* get a stream of variant */
    flatMap( 
        V->V.getGenotypes(). /* get the genotypes */
            stream(). /* convert to a stream of genotypes */
            filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
            map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
        ).
    collect( 
        Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
        ).forEach((K,C)->println(K+" : "+C)); /* print */
ADD COMMENTlink 15 months ago Pierre Lindenbaum 120k
0
Entering edit mode

BCFtools stats with any one of the below two options is what you need..

 -s, --samples <list>               list of samples for sample stats, "-" to include all samples
 -S, --samples-file <file>          file of samples to include

Example:

bcftools stats -s - <multisample VCF file>
ADD COMMENTlink 15 months ago prasundutta87 • 330
0
Entering edit mode

(deleted, initially missed the fact that hom-ref calls are not wanted here)

ADD COMMENTlink 15 months ago chrchang523 5.2k

Login before adding your answer.

Powered by the version 1.6