Question

Whole Transcriptome Analysis - Two Very Basic Questions For Differential Expression.

5

Entering edit mode

12.1 years ago

Ngsnewbie ▴ 380

In Transcriptome analysis, Some genes are expressed more, thus they will generate more cDNA while some genes are expressed less (will have less cDNA). These cDNA's are fragmented and then sequenced on various NGS platforms. We estimate the abundance of transcripts by assembling these fragments (now reads) and then mapped back on to the genome (or transcripts).I have two very basic questions as they often come in my mind -

My first question is that : What are the chances that every sheared fragment get sequenced? Suppose if some fragments which corresponds to one particular gene were not sequenced in one experimental condition while they were sequenced in second experimental condition to study, then we will get false differential expression analysis.

My second question is that : A read may align to two different position on genome. This may be due to homologous region or due to duplication of genes. Some people suggest to remove these multi-mapped reads or divide each multi-mapped read to all of the positions it maps to. But that will also hamper the exact differential expression of these loci. Could you shed some light over it.

transcriptome differential cufflinks next-gen sequencing • 4.0k views

ADD COMMENT • link updated 8.5 years ago by Biostar 20 • written 12.1 years ago by Ngsnewbie ▴ 380

score 6 · Answer 1 · 2012-03-15

First Question: Not every fragment will be sequenced. You are not even sure if every fragment is in the library. Depending on how you fragmented your sample and how you size selected, your library might not even be representative of your transcript composition.

But just like with any large scale experiments you have to trust/assume every step was homogeneous and you are selecting an unbiased sample out of your RNA popoulation. You can try to reduce variables by taking out ribosomal RNA and enriching for poly-A tails.

Second Question: Duplicated regions or common domains can make your read counts falsely higher or lower depending whether you choose to discard or divide them. If the reads that contribute to these ambiguous regions makes up a large proportion of the total read count for your gene, then I would not trust it.

This is a complicated issue in my opinion. It depends on if you want to make a couple of assumptions. If you assume sequenced reads are evenly distributed along the transcript (we can't make that assumption), then theoretically, you do not need to worry if the transcript is full length or not. With that assumption in mind, you could potentially only look at unique regions of the transcript and use the mapped reads of those region for your expression level. Expression level of 10k reads evenly distributed across 100 bp is the same as expression level for 50k reads evenly distributed across 50 bp.

Pair-end sequencing can alleviate this issue.

score 2 · Answer 2 · 2012-03-15

1 ,If the transcript exp_level was not too low, which means to have enough depth, It will be hardly impossible to miss that. But there always are some extremely low abundance transcript. So,yes,we got false negative now.But in transcriptome study,I believe it's only in minimum proportion, and will not do much harm to conclusion.

2 , If you directly map reads to genome, some can be ambiguous reads.You can either discard them away or divide to position_share transcripts and modify the ratio based on the different transcript level later.Long reads can improve this situation. Or you can assemble the reads before mapping step, can be helpful too.