Question

Low mapped percentage. How to know if data is junk?

2

Entering edit mode

6.5 years ago

c_u ▴ 520

TL;DR - I ran Tophat2 on human RNA-seq data and the mapping rate was only 9%. How can I determine if my sequencing data is junk, i.e. something went wrong with the sequencing.

Total beginner to RNA-seq analysis here.

I received 4 RNA profiling sequencing data, 3 for mouse and 1 for human. I ran Fastqc on all of them, and then I ran a trimmer code (which we had designed earlier and does the following - remove the first 3 bases, the last 7 bases, discard any read if it is smaller than 15 based etc..). Then I ran tophat on all the samples, with the only modification that the gtf file and the fasta file only has information about exons (this was because earlier Bowtie used to create problems with the full files). The tophat code -

tophat2 -o ./tophat/$x --transcriptome-index ~/.../mod.gencode.v19 --no-coverage-search --keep-tmp --num-threads 14 ~/...mod.GRCh37.p13.genome ./fastq/human/$x.fastq

(Of course, I used mouse gtf,fasta files for the mouse samples and the human files for human samples).

For all the 3 mouse samples, the mapping rate was around 50%, which is what we have seen for our samples before too. For the human sample, it was only 9%.

Now I want to know if the problem is with the sequencing or somehow I can do something to come up with a better mapping. A friend said that I should check the levels of some highly expressed genes (using the mapped bam file) on IGV, and see if the levels are what you would expect. So, I wanted to ask the community, as to what I can do to test if the data is indeed faulty, or if the mapping can be improved.

Note - it is sequenced by a Next Gen sequencer

Here is the Fastqc file for the human sample - https://drive.google.com/file/d/1hzotX-G2Bn0-SKmI8oX_pUNxokh76iT3/view?usp=sharing

RNA-Seq tophat • 1.8k views

ADD COMMENT • link 6.5 years ago by c_u ▴ 520

1

Entering edit mode

With just 9% alignment you can pretty much take any sample of the data (do not use the reads at top of the file) and blast them at NCBI as has been suggested.

Has this data been trimmed? If not try that first. There may be a big 3' bias (poly-A's in your data as well).

Try a different aligner (I suggest bbmap to see what you get. If the data is junk then no miracles are likely but you will at least convince yourself of that fact.

ADD REPLY • link 6.5 years ago by GenoMax 141k

0

Entering edit mode

Is this sequenced on a NextSeq?

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes it is on a next gen sequencing machine

ADD REPLY • link 6.5 years ago by c_u ▴ 520

0

Entering edit mode

The overrepresented sequences suggest this is from a Illumina NextSeq, with the polyG reads. Alternatively another sequencer using two colour labels.

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Have you tried taking some of the unaligning reads and putting them into NCBI blastn? that will tell you what these are, I'm guessing either contamination from another species or unremoved rRNA

ADD REPLY • link 6.5 years ago by Philipp Bayer 8.3k

0

Entering edit mode

Thanks Philip for your comment. How can I get these unaligned reads? I know that tophat creates 2 bam files - one for mapped hits, and another for the unmapped, but I don't know a way to extract the unaligned sequences from the bam file as it's a binary file. Any ideas?

ADD REPLY • link 6.5 years ago by c_u ▴ 520

1

Entering edit mode

If you run samtools view on your bam file of unaligned reads you'll get the reads in SAM format, from there you'll have to copy paste a few into the search window of NCBI blastn

ADD REPLY • link 6.5 years ago by Philipp Bayer 8.3k

0

Entering edit mode

Have you tried aligning the human reads to mouse, just to check if something went wrong there?

The data look quite low quality to me...

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k

0

Entering edit mode

So, do you mean that I should try to map the human fastq files to the mouse genome using the mouse gtf and fasta files? If so, is your hunch that the data is coming mostly from mouse sample somehow?

ADD REPLY • link 6.5 years ago by c_u ▴ 520

1

Entering edit mode

I don't know - just need to exclude potential contamination. Although (haven't tried) I would expect more human reads to map to the mouse genome due to homology.

ADD REPLY • link 6.5 years ago by WouterDeCoster 47k