Question

Mismatch in exome capture and mapping reference assembly in TCGA WXS data ?

1

Entering edit mode

7.8 years ago

subhajit06 ▴ 110

Hi all,

I have one question.

We wanted to call Copy number for some of TCGA WXS samples. But we noticed that the exome capture kit (target bed file) that was used, was based on hg18 (more specifically is was hg18 nimblegen exome version 2). In fact we went and verified that this bed file coordinates are actually from hg18.

Now from the bam file header it seems, it was mapped against GrCh37-lite which is a version of hg19.

So for most Exome CNV caller they need the capture target bed file as one of the inputs. But in this case it is not consistent with the reference genome version used in mapping. I think that'd be a problem.

So what would be the best way to call the copy number in this situation. Should we lift over the coordinate from the target bed file to GrCh37-lite ?

thanks,

--subhajit

TCGA Whole Exome data CNV call exome capture file • 3.2k views

ADD COMMENT • link updated 7.8 years ago by Cyriac Kandoth 6.0k • written 7.8 years ago by subhajit06 ▴ 110

score 1 · Answer 1 · 2016-06-26

Your proposed solution is appropriate - use the GRCh37 BAM files, and a corresponding GRCh37 target BED file. GRCh38 would be even better, but let's not get too carried away. ;)

At least for important cancer genes, the exon sequences haven't changed much between hg18 and hg19. Moreover, you are likely to find more differences between an actual tumor sample's exome and the hg18 exome, than between hg18 and hg19 exomes. So the hybridization efficiency of the capture kit is not severely affected because it was designed on the hg18 exome. But re-aligning those captured reads to GRCh37 (hg19) brings it closer to the haplotypes of the average human, than hg18 does.

Note that hg19 is a version of GRCh37. Here's the quick history - A: Human dna reference file with no prefix 'chr'

"hg18 nimblegen exome version 2" sounds like what WashU used for TCGA LAML, OV, BRCA, and UCEC. Their official hg18 BED file lives here, but you can fetch Nimblegen's official hg19 BED and GFF from here. Their file named SeqCap_EZ_Exome_v2.bed will contain 2 tracks named "Target Regions" and "NimbleGen Tiled Regions". For copy-number calling, you'll need only the "Target Regions", and also remove regions in chrX, chrY, chrM, and any unaligned contigs... because CN calling on those is tricky. Plus, removing those makes it easier to convert hg19 to GRCh37.

score 0 · Answer 2 · 2016-06-24

Yes, The fact that your targets where designed to hg18 might be a problem, so you could also try to map the reads against hg18 just for comparison.

You could also try to lift-over the exome targets and stick to the new positions of the assembly (hg19).

You should stick to the alignment that has the biggest number of reads aligned against the reference genome (hg18 or hg19). My bet is that you will have a better alignment against hg18.