Biostar Beta. Not for public use.
How to check which samples has more uncalled genotypes in multi-sample vcf
0
Entering edit mode
3.0 years ago
BAGeno • 130

Hi,

I have multi-sample vcf and in this vcf, there are many sites which have uncalled or missing genotype. Is there a way to check which sample has greater number of uncalled genotypes in vcf. So that I can exclude that sample from further analysis.

ADD COMMENTlink
0
Entering edit mode

Hello BAGeno,

see my answer in this thread. You just have to adopt the genotype in the awk script or if it's a small file and speed doesn't matter this more easy one.

fin swimmer

ADD REPLYlink
3
Entering edit mode
17 months ago
France/Nantes/Institut du Thorax - INSE…

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().flatMap(G->G.getGenotypes().stream()).filter(G->!G.isCalled()).map(G->G.getSampleName()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->println(K+"\t"+V));' src/test/resources/test_vcf01.vcf  | sort -t $'\t' -k2,2n



S3  8
S4  9
S5  14
S6  18
S2  23
S1  73
  • stream().get a stream of variants
  • flatMap(G->G.getGenotypes().stream()). map to a stream of genotypes
  • filter(G->!G.isCalled()). keep the uncalled genotype
  • map(G->G.getSampleName()). map to the sample name
  • collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) convert to associative array sample/count
  • .forEach((K,V)->println(K+"\t"+V)); print the results.
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1