Question

Transcriptome Assembly Quastions

0

Entering edit mode

6.6 years ago

tawares07 • 0

Hi guys!

I have been using transcriptome assembly (genome-guided) to identify novel alternative splicing transcripts in human transcriptome. After the execution of mapping and assembly, I had some questions that may improve or reduce some "noise" in my results:

1) For mapping, do you use scaffolds and chrM or only chr1-chr22,chrX and chrY?

2) In my GENCODE GTF file I have annotations from both mRNAs and non-coding RNAs. Do you remove annotations from non-coding RNAs?

3) For transcritome assembly (in my case, StringTie), what is the minimum coverage or depth to consider a transcriptome assembled?

I would be glad if you could shared your experience I help me to improve my research.

Best, Raphael

RNA-Seq Transcriptome assembly StringTie • 1.4k views

ADD COMMENT • link updated 6.6 years ago by Kevin Blighe 88k • written 6.6 years ago by tawares07 • 0

score 0 · Answer 1 · 2017-09-28

Olá Raphael, boa tarde (eu falo português da forma fluente)

1) For mapping, do you use scaffolds and chrM or only chr1-chr22,chrX and chrY?

If you have no intention of researching chrM, other scaffolds, or the sex chromosomes, then you can justify removing them - it depends on what your aims are. However, won't StringTie then try to assemble them anyway (if reads from these chromosomes are in your data)? It depends on the behaviour of StringTie when you use a genome-guided assembly.

I note that StringTie, if you supply a reference GTF file, will normalise counts over the GTF transcripts. This normalisation process will be influenced by the presence of a chrM, X, Y, etc., but only slightly. For raw coverage (raw counts), it makes no difference, as it would then be just counting reads over each position (and not normalising them).

Take a close look at the -x parameter of StringTie:

-x <seqid_list> Ignore all read alignments (and thus do not attempt to perform transcript assembly) on the specified reference sequences. Parameter <seqid_list> can be a single reference sequence name (e.g. -x chrM) or a comma-delimited list of sequence names (e.g. -x 'chrM,chrX,chrY'). This can speed up StringTie especially in the case of excluding the mitochondrial genome, whose genes may have very high coverage in some cases, even though they may be of no interest for a particular RNA-Seq analysis. The reference sequence names are case sensitive, they must match identically the names of chromosomes/contigs of the target genome against which the RNA-Seq reads were aligned in the first place. source: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual

2) In my GENCODE GTF file I have annotations from both mRNAs and non-coding RNAs. Do you remove annotations from non-coding RNAs?

It is no problem keeping the ncRNAs. They are genes like every other gene, the only difference being that they have a single exon. Any good transcriptome assembler will be able to distinguish the boundary between one gene and another.

3) For transcritome assembly (in my case, StringTie), what is the minimum coverage or depth to consider a transcriptome assembled?

Do you mean average coverage across an entire transcriptome or coverage over an individual transcript? Transcriptome assembly with TopHat or StringTie is different from that of other assemblers like Velvet/Oases because you typically use a reference genome FASTA and GTF with TopHat/StringTie. The key parameters are -j -c and -B

Boa sorte cara!

Abraços Kevin