Question

How to map RNA-seq reads of a species to a reference genome of another close species ?

1

Entering edit mode

7.7 years ago

Farbod ★ 3.4k

Hi Dear Friends, ( I'm not native in English so, be ready for some possible language flaws)

I have 6 paired end 100 bp fastq files (3 for female and 3 for males, paired-end reads = 12 files) which is from Illumina HiSeq RNA_seq process of a non-model fish. I have run some de novo transcriptome assembly using Trinity package.

Now I want to do some genome-guided investigation on the same data to check for the annotation of transcripts that did not show any blast hit in de novo assembly.

So, I have downloaded a close species reference genome from NCBI.

Now I want to map my reads to that genome.

Which splice aware aligner is preferred? STAR or BBmap?
In the old "bowtie --> tophat --> cufflink" pipeline, it must be some "bowtiebuild" preparing the reference genome file, is it needed yet?
Can I use all my 12 files in just one command? or I must use sample1-lef - sample1-right mapping first and then repeat it for other 5 files?

Please share any useful script for aligning that you think it would be useful.

Thank you in advance

alignment RNA-Seq genome • 2.8k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.7 years ago by Farbod ★ 3.4k

score 2 · Answer 1 · 2016-07-21

2

Entering edit mode

7.7 years ago

ablanchetcohen ★ 1.2k

STAR is currently the most widely-used aligner for RNA-Seq data. It is extremely fast, but memory hungry. STAR is the aligner I currently use, after having used TopHat for several years.

TopHat still works fine, but has unfortunately been declared obsolete by its developers. It is also extremely slow compared to STAR. We're talking hours vs minutes.

HISAT2 is the successor of TopHat.It's much faster than TopHat. It's a bit slower than STAR, but requires far less memory. There have been a few posts questioning some of its results, notably regarding pseudogenes, but the overall results seem very similar to TopHat's.

I haven't used BBmap, but it is popular on this forum, given that it's developper is active on the forum, and always eager to answer questions.

Anyone of these aligners will produce good alignments.
All the aligners that I know of first require an index to be built. STAR even requires different indexes depending on the read length.
You'll need 6 commands. This is the reason people have automated pipelines, in Python or Perl, to generate the command scripts. You can launch the 6 commands in parallel. If you have a queue manager, the queue manager will ensure that the server is not overloaded.

ADD COMMENT • link 7.7 years ago by ablanchetcohen ★ 1.2k

1

Entering edit mode

Dear ablanchetcohen, Hi and thank you.

I think the STAR is your first suggestion.

So, do you have any link for those Perl or Python automated mapping pipelines?

ADD REPLY • link 7.7 years ago by Farbod ★ 3.4k

1

Entering edit mode

I have written my own, but they're not quite ready for release. The MUGQIC have released their pipelines on Bitbucket, but they're quite complicated to use.

If you really do a lot of analyses, you'll end up finding that the best option is just to write your own scripts. If you only do a few analyses, copy-pasting the same script also works fine.

Of course, there is always Galaxy.

ADD REPLY • link 7.7 years ago by ablanchetcohen ★ 1.2k

1

Entering edit mode

Thank you, you are right, I will try my own pipeline.

I have seen in the Star manual, that there are several option for 1- indexing the genome and 2- mapping procedure

which of these options do you usually use and suggest for my case ?

(100 bp paired end illumina Hiseq2000, non model fish, and the close (not the same) species reference genome)

I really appreciate your help

ADD REPLY • link 7.7 years ago by Farbod ★ 3.4k

1

Entering edit mode

BBMap works well as well. If you happen to use it make sure you add sam=1.3 option to your command line. Counting programs (featureCounts and HTseq-count) don't understand sam v.1.4 format tags which BBMap outputs by default).

ADD REPLY • link 7.7 years ago by GenoMax 141k

1

Entering edit mode

Hi my friend and thanks,

I have search for a comprehensive step by step manual of BBMap and have found this:

https://www.biostarhandbook.com/tools/bbmap/bbmap-help.html

Do you have any better one (link for manual) ?

ADD REPLY • link 7.7 years ago by Farbod ★ 3.4k

1

Entering edit mode

There are individual threads for BBMap tools at SeqAnswers. @Brian also has documentation/guides available in the program under bbmap-nn.nn/bbmap/docs directory. If something is unclear, we can help. Tag your post with "bbmap" so @Brian sees it.

Simply, a workflow would be FastQC --> BBDuk/Trimmomatic/Cutadapt --> BBMap/STAR/HISAT2 --> Samtools --> featureCounts/HTseq-count --> DESeq2/edgeR

I find it easy to stay with BBTools since the command line format is consistent but you are free to choose/use any of the above tools.

ADD REPLY • link 7.7 years ago by GenoMax 141k

1

Entering edit mode

Hi, is this fact that the number of chromosomes of my species is not equal to the number of chromosome of the reference genome species making any problem ?

ADD REPLY • link 7.7 years ago by Farbod ★ 3.4k

1

Entering edit mode

No as long as you keep in perspective that the results you are getting are in reference to the surrogate reference genome.

You are not going to use positional information (but rather alignments) to get your counts. So if you discover that gene X is DE then you can't use the positional information as is since in your organism the gene could be on a different chromosome or location.

ADD REPLY • link 7.7 years ago by GenoMax 141k