Question

Regions of the genome that may include CNVs

1

Entering edit mode

8.8 years ago

alesssia ▴ 580

Dear all,

I am preparing a list of regions of the genome that are lucky to include CNVs. To do so I am excluding assembly gaps, regions with poor mappability, and repeat regions as reported in UCSC. I know from the literature that regions worth excluding are also those near centromeres/telomeres, and those having low/high GC. My questions are: a) what "near" a centromere/telomere means? and b) which are meaningful thresholds for GC content? Finally, c) is there any other feature I should be aware of?

Thank you very much!

UCSC CNV • 2.5k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by alesssia ▴ 580

score 0 · Answer 1 · 2015-06-17

0

Entering edit mode

8.8 years ago

mehran.karimzade ▴ 220

If you have not used a repeat masked genome for your alignment, you may want to exclude these regions as well (available through ucsc tracks). Doing so + excluding gaps, would automatically handle centromere/telomere regions.

Most CNV callers automatically handle CpG content by normalizing number of reads to this. You don't need to filter high CpG content regions of the genome.

ADD COMMENT • link 8.8 years ago by mehran.karimzade ▴ 220

0

Entering edit mode

I am not selecting regions to run a CNV caller, but to create a null distribution for a set of statistical analyses. Therefore knowing how to deal with CpG content is important.

Thanks, for the answer about the telomere/centromere. I am indeed removing all the gaps (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=gap) as well as the repeating regions. Should I also remove "Regions of Exceptionally High Depth of Aligned Short Read" (http://genome.ucsc.edu/cgi-bin/hgTrackUi?&c=chr17&g=hiSeqDepth)? What is a good threshold in this case?

ADD REPLY • link 8.8 years ago by alesssia ▴ 580

0

Entering edit mode

Most of the regions with exceptionally high depth of aligned short reads must be excluded by repeat masker/gapped regions. Those that are not would have a coverage depending on your library size. Best would be to make several histograms and make a decision based on that.

However I would not be much worried about those regions for the purpose of copy number variation calling because they would have a similar high coverage in all your cases and controls.

If you insist on removing those regions, do not do it based on definitions of the ucsc track, but use a coverage filter of your own data.

ADD REPLY • link 8.8 years ago by mehran.karimzade ▴ 220

0

Entering edit mode

I am not doing a copy number variation calling: I am just selecting genome regions to crate a null distribution to perform a set of statistical analyses, that is: I have all the genome (as in UCSC), but I need to extract only those regions that may include a CNV to not bias (in my favour) the analyses. But, yeah, I got your point: regions with exceptionally high depth of aligned short reads are a problem of UCSC (or of a specific experiment), not in general!

Thanks.

ADD REPLY • link 8.8 years ago by alesssia ▴ 580