Biostar Beta. Not for public use.
How To Determine If Paired–End Illumina Rnaseq Reads Are Strand–Specific
18
Entering edit mode
6.1 years ago
bedeabc • 180

I've been provided with more than a billion reads of RNAseq data for a poorly annotated nematode species. They appear to be 2x100 paired-end Illumina reads – I currently know frustratingly little about the RNAseq protocol used, but need to perform assemblies using Trinity.

Trinity demands that I specify whether or not the reads are strand–specific, and also which strand is which through the _--SS_lib_type_ parameter, which needs to be either _FR_ or _RF_.

For each tissue sample, I have been given paired _fwd_ and _rev_ FASTQ files. How can I tell i) whether the data is indeed strand-specific, and ii) which strand is which, so that I know whether to use FR or RF with Trinity.

Any thoughts much appreciated. Here are the top four lines from two corresponding FASTQ files I've been given:

head -n 4 Tmuris_adult_R4*
==> Tmuris_adult_R4_fwd.fastq <==
@HS23_6814:1:1101:1592:2250#4/1
GCGGTATCAGTTGGTAAACCCTGCAGGCGCTCGCATAACGGTCGAAGGCTTTTTGCGGATCGTCGTCATTGTCGTTGACCTCAGCATCGCNCACCTCCTC
+
B3:64JGADLBACJHH3EACD@DJAHLJDIENFEKIJJ6LE-HFJH57H7L9=BAFI8@FK>,GBDH764,5,4A='+G+,+,*E++@+2!+:+1>1=+4

==> Tmuris_adult_R4_rev.fastq <==
@HS23_6814:1:1101:1592:2250#4/2
CGAACCCNGTATNTTTGCGCTACTNTGTCTCCTACGCCTTTGTCTGTCTTGCCTGCATGGCTAACACTGCCCTGTTGGTTCAAGTGTCGTCTGCCGGAAG
+
:ABEGGH!G8EJ!8EJE6IEFBIH!HF8EKDD66FFAMDCKE/5>D5LD?E=?AHG>=AE5@E5I@CGB<KK@GG<B2E:H@2I9ICI?C@HC2@2:0@2
ADD COMMENTlink
0
Entering edit mode

There is now a nice compilation of all the different variations of this question: Read pair orientation : Illumina TruSeq Stranded mRNA library

ADD REPLYlink
17
Entering edit mode
3.9 years ago
matted 7.0k
Boston, United States

I'll first say that it doesn't bode well for the rest of the project if your collaboration is so distant that you can't get a basic question like this answered from the providers of the data.

There's no way to figure out the strandedness from looking at raw reads. I would align the reads to a reference genome (or a closely related species, if you don't have a perfect reference). Then load the alignments in a browser (e.g. IGV) and look for expressed transcripts by eye. You could look at known gene locations, depending on the quality of the annotations. If the reads are typically observed from both strands on the exons, you can assume it's not strand-specific. If each exon seems to have a clear strand preference, you would assume it's strand-specific.

ADD COMMENTlink
7
Entering edit mode

In IGV, best is right-click on the reads, then color alignments by first-of-pair strand. Ideally, you should predominantly see one color within a transcript. From that, you can figure out whether read1 has the same direction as the gene, or the opposite.

ADD REPLYlink
0
Entering edit mode

I now know that when I am looking at the gene that is marked as (+) in UCSC browser, then left read is marked as second in pair, whereas right read is marked as first in pair. Does this mean that I am dealing with FR or RF data? Thanks a lot. It is indeed true that when I check color alignments by first-of-pair strand every transcript gets only one color.

ADD REPLYlink
6
Entering edit mode
14 months ago
Pittsburgh, PA

To tell whether RNA-seq reads are strand-specific, you must first perform an alignment. Following alignment you can use an automated tool like infer_experiment.py in the RSeQC package that will tell you whether the data is single-end or paired-end, strand-specific or unstranded, and if strand-specific, what type (first strand or second strand).

infer_experiment.py samples a few hundred thousand reads from your bam/sam and tells you the portion that would be explained by each type of strandedness, e.g.:

$ infer_experiment.py -i accepted_hits.bam -r hg19.refseq.bed -s 400000
Reading reference gene model hg19.refseq.bed ... Done
Loading SAM/BAM file ...  Total 400000 usable reads were sampled

This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0209
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9791
Fraction of reads explained by other combinations: 0.0000

When you get your output you can refer here for help with interpretation.

ADD COMMENTlink
1
Entering edit mode
20 months ago
Ryan Thompson ♦ 3.4k
TSRI, La Jolla, CA

If you don't have a reference that you can use, you could try doing a de novo assembly and mapping the reads to it. The assembly doesn't have to be great to give you enough information to tell if the reads are stranded.

ADD COMMENTlink
0
Entering edit mode
6.1 years ago
bedeabc • 180

Thank you matted. I had thought that might be the case – I suppose I hoped there might something in fastq headers that gave away strand specificity. I'm building a SAM now and will use IGV for viewing

I agree it does not bode well! I am awaiting a reply from our collaborators in Cambridge (UK, not MA!) who I do imagine will be very helpful.

ADD COMMENTlink
1
Entering edit mode

Best of luck. Just a clarification: strand specificity for RNA-seq is achieved through using a specific library preparation protocol, so it's totally decoupled from the sequencing step (see e.g. this paper). That is, the sequencer (which is producing the read names and the fastq file) has no way of knowing if the DNA molecules it's sequencing come from a strand-specific library or not. The only (rare) exception might be if the sequencing facility manually changed the read names to something that reflected the library status.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1