Question

Genome-aligned RNA-Seq: one treatment showing high% of unmapped reads:too short

0

Entering edit mode

5.5 years ago

Biogeek ▴ 470

My issue is with recent RNA-seq data we have. I've aligned my RNA-seq reads to the genome with STAR. The animal is a cnidarian at question. We have a control set of reps and a treated set of reps. Only the treated set are showing poor alignment rates. RNA extraction, enrichment and sequencing were performed on all samples in the same time and run.

The output of STAR for the low aligning samples is that 60% of reads are not mapping due to being 'too short' - this seems to be characteristic for all the treatment reps. QC of reads seems fine. I've used minimal trimming with Trimmomatic as I don't want to remove a lot of valuable data. No head cropping or anything that could affect the alignment % was performed.

The same results are also produced using Salmon. So it doesn't seem to be an issue with software. I've noticed in the treatment samples the GC content is up by 2% compared to control samples.

I'm starting to think contamination? or is there something else at stake?

Thanks :-)

RNA-Seq mapping low alignment GC content • 1.5k views

ADD COMMENT • link 5.5 years ago by Biogeek ▴ 470

1

Entering edit mode

The "too short" refers to the alignment length rather than the read length. (see here) Meaning that 60% of the reads' alignments are not matching the reference.

You can store the unmapped reads with --outReadsUnmapped Fastx and analyse these further. E.g. you can run fastqc and check the overrepresented sequences.

Does the treatment influences the cells too drastically (fragmenting its RNA)? Could your samples be mixed up in the either the lab or the sequencing facility? Did you share the sequencing run with others?

ADD REPLY • link 5.5 years ago by michael.ante ★ 3.8k

0

Entering edit mode

Hey Michael,

I've done so. I took a proportion of those unmapped reads as fastq and converted to fasta. seems a lot are coming back as fungi and bacteria. As such, I'm gonna assemble a de novo transcriptome of all those reads and annotate the entirety of unmapped reads... OR is there an easier alternative to show what each library has contamination wise?

As for carrying forward with those reads aligned, and as a second opinion: Would it be satisfactory to do DE analysis as long as a suitable normalisation is conducted between libraries (ie TMM / quartile)??

Thanks.

ADD REPLY • link 5.5 years ago by Biogeek ▴ 470

0

Entering edit mode

Bets programs that can delineate contamination from .fastq?

ADD REPLY • link 5.5 years ago by Biogeek ▴ 470

0

Entering edit mode

You may try FastQscreen: "FastQ Screen is a simple application which allows you to search a large sequence dataset against a panel of different genomes to determine from where the sequences in your data originate. "

If you know your main contamination source, you can use bbsplit from the bbmap suite to separate the contamination.

Before starting DE analysis, I'd check the alignments' quality, with RSeQC (geneBodyCoverage, read_distribution, ...) to check if the 40% target hits are OK.

Cheers, Michael

ADD REPLY • link 5.5 years ago by michael.ante ★ 3.8k

0

Entering edit mode

I'm going back and seeing if sorting .FastQ files has any affect.

ADD REPLY • link 5.5 years ago by Biogeek ▴ 470

0

Entering edit mode

Please give some details. You say STAR complains about 'too short' reads. What are the read lengths? It has a notably influence on the mapping efficiency. See a recent post of mine that is slightly related to read length and mapping %.