Question

Problems with exome coverage at HRNR gene

1

Entering edit mode

6.5 years ago

ognjen011 ▴ 250

Hi!

After some exploratory analysis of whole exome sequencing experiments and the coverage of particular regions, I ran into some unusual but consistent places where the coverage is small or practically non-existent regardless of the fact it overlaps a bait region. Most of them can be explained by extreme GC content, but some of them are problematic for reasons unclear to me. Especially puzzling is the HRNR gene which has a zero-coverage region in the region 152190000-152190100 (b37 build) for almost every BAM file I visualized. This includes both Agilent Sure Select v5 and SeqCap 3.0 capture kits, which each contain a different baiting region over it. GC content is around 60% in that region, and although the surrounding regions contain many reads that multi-map, I do not see the reason for this particular hole.

Any thoughts on the cause would be appreciated.

Thanks in advance, people!

sequence alignment wes hybrid capture • 1.3k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 87k • written 6.5 years ago by ognjen011 ▴ 250

score 3 · Answer 1 · 2017-10-10

3

Entering edit mode

6.5 years ago

Kevin Blighe 87k

Hey, my first thought was that the region may exhibit sequence similarity and/or even have a paralogue somewhere else in the human genome. So, I obtained the sequence for your region here (subtracting 1 base due to co-ordinates issues at the UCSC) and then ran nucleotide BLAST against human transcript sequences. Sure enough, the region exhibits 83% similarity with an exon of a nearby gene, FLG2. As one looks at these on the UCSC, I think that it's possible that they did indeed arise as a result of a gene duplication event - who knows. These phenomena are way more common than people think.

This may explain the absent coverage in some of your samples, and also why other samples do exhibit coverage. I have been working around these issues of sequence similarity in NGS panels for quite a few years now.

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello Kevin!

Thank you for the comment, it is very insightful, and thanks for direct links that illustrate the problem. I did actually blast the sequence, but didn't focus on it because I assumed that the process goes both ways - that such homologous regions exchange reads rather than plain steal them. It is clear now I was wrong :)

I would also say that there could be more contributing factors - I noticed that whole genome sequencing samples do not exhibit this drop, which was tested on both real and simulated data. Could it be that hybrid capture is somehow problematic at this region?

ADD REPLY • link 6.5 years ago by ognjen011 ▴ 250

0

Entering edit mode

Hey, yes, by all means, this is not going to be the sole problem. This 'read robbing' by regions of sequence similarity does indeed work in both ways and I would have hoped for even a few reads mapping to HRNR in all of your samples. I am certain that it is part of the problem, though.

As to which you've alluded, certain regions of the genome are just very difficult to sequence, for whatever reasons. I think that we are still trying to understand why. One colleague of mine at the University of Leicester in the UK has been researching thermodynamically ultra fastened (TUF) regions, which are essentially those that resists denaturation at temperatures >100 Celcius and therefore remain inaccessible to the primers used during the sequencing run (but you may get some denaturation). Here is some of that work, if interested:

Other than that, things that I think about are folding of the DNA molecule, i.e., if a region of DNA is buried in the chromatin, then it may remain somewhat inaccessible (although many genes are obviously near the surface, otherwise, they would not be transcribed by TFs in the nucleus). However, I am convinced that, for example, certain SNPs can shape the chromatin landscape and thus make certain genomic regions less accessible in certain individuals compared to others. This is all just hypothesis but based on previous literature.

One thing that people are now beginning to do more and more is ATAC-seq, which essentially checks chromatin accessibility at a region of interest. You may wish to consider that, funding permitted.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Another thing, when I look at that region on the UCSC, is that an antisense transcript called FLG-AS1 runs past both FLG2 and HRNR, but I cannot think how/if this could affect the hybridisation of your probes, but may merely help to explain better what's going on at this particular locus.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Interesting indeed. I cannot get any info on it. Th wiki page lists the RNA as protein: https://en.wikipedia.org/wiki/FLG-AS1 :)

It is a 300K long gene. Could it somehow interfere with capture with its transcript?

ADD REPLY • link 6.5 years ago by ognjen011 ▴ 250

0

Entering edit mode

I think that the Wikipedia article is mistaken in calling it a protein. I wasn't aware that antisense transcripts were translated into proteins because they simply don't have the correct sequences to initiate the process!

An antisense is more likely to affect transcription of it's corresponding sense-strand transcripts. If they are transcribed at the same time, for example, the antisense RNA can bind to the sense mRNA and block translation. The presence of the transcriptional machinery on both the sense and antisense strands at the same time can also prevent the act of transcription itself because the molecules literally block each other.

The other possibilities that come to mind are that the hybridisation probes in your kit may have degraded (they can degrade in a biased fashion), and/or that your DNA itself was slightly degraded, which would only come out via gel electrophoresis and not necessarily looking at concentration or A260/A280.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k