Should I use a bad/ mediocre gene model as input for STAR RNA-seq alignment?
1
0
Entering edit mode
6.2 years ago
William ★ 5.2k

Should I use a bad/ mediocre gene model as input for STAR RNA-seq alignment?

In the STAR RNA-seq aligner manual I read that a gene model should be used when indexing a reference genome before alignment. https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

--sjdbGTFfile species the path to the file with annotated transcripts in the standard GTF format. STAR will extract splice junctions from this file and use them to greatly improve accuracy of the mapping. While this is optional, and STAR can be run without annotations, using annotations is highly recommended whenever they are available. Starting from 2.4.1a, the annotations can also be included on the fly at the mapping step.

I would like to know if this is still recommended when working with non model organism species that has a bad or mediocre gene model.

Question 1:

Is there a risk that RNA-seq reads will be aligned in the wrong location because of a mistakes in the gene model? How big is this risk?

A goal or the RNA-seq alignment is to improve / curate the gene model.

Question 2:

Do I need to rerun all my RNA-seq alignment (ie. re-create RNA-seq BAM files) every time I have slightly or substantially improved my gene model?

Or does 2-pass mapping mode already reduce the need for re-running after having upgraded the gene model?

For the most sensitive novel junction discovery,I would recommend running STAR in the 2-pass mode. It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual parameters, then collect the junctions detected in the rst pass, and use them as "annotated" junctions for the 2nd pass mapping.

Question 3a:

Should I always do multi-sample 2-pass mapping? Also if the RNA-seq samples are from multiple different projects / experiments?

For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples. 1. Run 1st mapping pass for all samples with "usual" parameters. Using annotations is recom- mended either a the genome generation step, or mapping step. 2. Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab ....

Question 3b:

Does this mean that I should always re-align all RNA-seq samples from fastq after having received new RNA-seq samples? (because the new RNA-seq samples might cause a new splice junction to be considered for alignment of the existing RNA-seq samples?)

Thank you.

RNA-Seq star gene_model incremental • 2.2k views
ADD COMMENT
1
Entering edit mode

How bad is the annotation you have? If it's just highly incomplete but the exons in it seem reasonable then it'd make sense to use it. If they themselves are questionable then definitely don't use it.

ADD REPLY
0
Entering edit mode

Gene model is mostly incomplete I think. ie. exons, transcripts and complete genes are missing. Also 2 predicted genes might actually be 1 gene. The gene models is based on 3rd party "push button analysis" ab initio + rna-seq based gene prediction (i.e. no manual curation, or consensus analysis of multiple prediction software. ) The rna-seq data is from the same species. I am not sure how many "false positive" predicted exons, genes and transcripts there are.

ADD REPLY
1
Entering edit mode

In that case I'm in complete agreement with Santosh's answer.

ADD REPLY
3
Entering edit mode
6.2 years ago

Q1: Gene-model is used only as a guide - not as an obligatory anchor. So, if you have a reasonable gene-model, even partial or incomplete, you should use it.

Q2: IMO, 2-pass should be sufficient

Q3a: For the same project, it is recommended because in this way, you have a pool of putative junctions and evidence from more samples about their existence. This could help in determining which of them are TRUE and which other are false positives. I'd not mix junction from different projects because they might confound, given that different projects can have different splicing events.

Q3b: Theoretically yes. But it would be much costly to re-run all samples again (this is what is called n+1 problem). I am also guessing that it might not change the alignment much (so the cost vs. gain is unfavorable). However, since you are working on a non-model organism, why don't you just make a quick check for yourself. Eg. how much difference you find in running n vs n+1 samples?

ADD COMMENT

Login before adding your answer.

Traffic: 2344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6