Estimating cross contamination in a set of BAMS
2
1
Entering edit mode
5.8 years ago

Hi all,

I've received a set of BAM files , the variant were called with bcftools

    ${bcftools_exe} mpileup -Ou -f "${REF}" \
            --bam-list "${bam_list}" \
            --regions-file "${bedfile}" \
            --annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/ADR'  \
            --redo-BAQ --adjust-MQ 50  --min-MQ 30  |\
    ${bcftools_exe} call \
            --ploidy GRCh37 \
            --multiallelic-caller \
            --variants-only -O z -o "output.vcf.gz"

but I suspect there is a cross-contamination between the sample, because many of the HOM_REF genotypes contain a few ALT allele.

The variants were called with samtools, but some genotypes called as HOM_REF contain a few ALT

 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | Sample  | Type    | AD     | ADF   | ADR   | DP  | GT  | PL        | SP |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+
 | 28D0609 | HOM_REF | 206,15 | 97,9  | 109,6 | 221 | 0/0 | 0,255,255 | 4  |
 | 37D1676 | HOM_REF | 154,10 | 89,5  | 65,5  | 164 | 0/0 | 0,229,255 | 1  |
 | 13D0720 | HET     | 170,59 | 92,27 | 78,32 | 229 | 0/1 | 134,0,255 | 5  |
 | 37D1631 | HOM_REF | 155,16 | 73,8  | 82,8  | 171 | 0/0 | 0,76,255  | 0  |
 | 57D1188 | HOM_REF | 85,0   | 39,0  | 46,0  | 85  | 0/0 | 0,255,255 | 0  |
 | 14D2313 | HOM_REF | 101,0  | 50,0  | 51,0  | 101 | 0/0 | 0,255,255 | 0  |
 | 24D2314 | HOM_REF | 48,0   | 18,0  | 30,0  | 48  | 0/0 | 0,144,255 | 0  |
 | 24D0430 | HOM_REF | 64,0   | 31,0  | 33,0  | 64  | 0/0 | 0,193,255 | 0  |
 | 18D0610 | HOM_REF | 55,0   | 29,0  | 26,0  | 55  | 0/0 | 0,166,255 | 0  |
 +---------+---------+--------+-------+-------+-----+-----+-----------+----+

Some samples were sequenced in the same flowcell/lane.

How can I validate the hypothesis of a cross contamination ?

I was suggested to use verifyBamID but as far as I understand, It need another VCF called with another method (?)

I also tried to use Gatk ContEst but I've no idea of what I'm doing...

 java -ja GenomeAnalysisTK.jar -T ContEst -I bam.list -R human_g1k_v37.fasta -o out.metrics  --genotypes my.vcf.gz -pf  1000G_phase1.snps.high_confidence.b37.vcf --min_genotype_depth 20 -L 22


INFO  10:17:00,850 ContEst - Total sites:  31803838 
INFO  10:17:00,860 ContEst - Population informed sites:  310728 
INFO  10:17:00,861 ContEst - Non homozygous variant sites: 310728 
INFO  10:17:00,861 ContEst - Homozygous variant sites: 0 
INFO  10:17:00,861 ContEst - Passed coverage: 0 
INFO  10:17:00,861 ContEst - Results: 0

any suggestion ?

contamination bam • 4.3k views
ADD COMMENT
1
Entering edit mode

I was also suggested to look for rare variants: they should not be found in unrelated samples.

ADD REPLY
0
Entering edit mode

Ideally if original samples are available then doing independent SNP genotyping would be the way to verify identity of samples.

ADD REPLY
0
Entering edit mode

verifyBamID does need a vcf, but it is a population reference VCF (1000genomes)

I've used it for detecting contamination in a targeted panel with alright results. see my question on their user group page.

reference_vcf=/media/sf_BigShare/SCID/180213_TSCA_r1_sop_test/work/reference/180124-1000G_phase1.snps.high_confidence.hg19.intersected_w_scid.vcf
./verifyBamID --vcf $reference_vcf --bam $bam --out $out --maxDepth 1000 --precise --ignoreRG
ADD REPLY
0
Entering edit mode

from twitter:

ADD REPLY
5
Entering edit mode
5.8 years ago
igor 13k

A really nice method is GATK CalculateContamination and gives you an exact contamination estimate. It works if you have WGS/WES data (to provide sufficient coverage for enough SNPs). They provide a reference VCF for human genome. It needs to be in a specific format, so can be tricky to generate for other species, especially since population frequencies may not be known.

I've been using Bamkin, which is fairly simple and crude, but seems to work sufficiently well to detect sample mixups. I processed hundreds of samples and it's always been clear when some of them are problematic, at least when you have multiple samples from supposedly the same individual. The nice thing is it will work with smaller targeted panels or ChIP-seq or RNA-seq. You can also tell if the contamination is coming from other samples in the same batch of samples.

ADD COMMENT
1
Entering edit mode

I was going to suggest CalculateContamination. For more detail about the method see also section VI of mutect.pdf.

ADD REPLY
0
Entering edit mode

I didn't realize there is a manual for CalculateContamination. That's helpful.

ADD REPLY
1
Entering edit mode

Ran into some error Key when getting my ExAC variant vcf ready for GATK4 CalculateContamination: AC_Adj0_Filter found in VariantContext field FILTER at chr

If someone is using ExAC vcf's for common variants, check Sheilas last post on: https://gatkforums.broadinstitute.org/gatk/discussion/8181/gatk-selectvariants-on-vcf

Program works fine. I spiked 2.6% reads from another sample in my FASTQ and GATK detected 3.5% contamination. Thanks for mentioning BamKin, looks really straight forward!

ADD REPLY
1
Entering edit mode
5.8 years ago

In my limited experience, an easy method to investigate cross-contamination is to look at unexpected deviations of alignment to the sex chromosomes, which only works if the samples are contaminated by samples from the opposite sex.

ADD COMMENT

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6