Question

Why Is The Tophat2 Unmapped.Bam File Larger Than The Accepted.Bam File?

1

Entering edit mode

11.5 years ago

Varun Gupta ★ 1.3k

Hi Everyone

I am using rna seq data with tophat2 and getting unmapped.bam file in addition to usual accepted.hits.bam file.

The unmapped.bam file is quite big in size as compared with alignment file. Is there anything i should worry about.

Regards V

• 5.0k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 11.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Hi Gupta

Did you find answers about this problem? now I met same situation, my accepted-hits.bam and ummapped.bam are about 1M and 1.2G respectively. If you have any ideas to solve it, please help me.

Thank you!

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Ginsea Chen ▴ 130

0

Entering edit mode

Not really Chen. I moved to STAR aligner

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by Varun Gupta ★ 1.3k

Ram · Answer 1 · 2012-10-18

3

Entering edit mode

11.5 years ago

Istvan Albert 100k

It is larger because you have more unmapped reads than mapped reads. Whether or not that is something to worry about depends on many factors - the origins of samples, the genome you align against etc.

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 11.5 years ago by Istvan Albert 100k

0

Entering edit mode

I am using the default parameters only...

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 11.5 years ago by Varun Gupta ★ 1.3k

Ram · Answer 2 · 2012-10-18

0

Entering edit mode

11.5 years ago

swbarnes2 14k

Did you try looking at some of the unmapped reads to see why they were unmapped?

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 11.5 years ago by swbarnes2 14k

0

Entering edit mode

I DID

Some of them have N's . So one reason is that for sure

But the ones which does not have N's how to deal with them..

Regards

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 11.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Well, did you examine the ones without N's? Are they poor quality? Are they genomic contamination? Another species contamination?

There might be a very big problem. Maybe the sample prep people are screwing up. Maybe a kit was defective. Maybe your reference file is screwed up. Maybe you were totally misinformed as to what the project was about. You won't know unless you actually look at your raw data.

ADD REPLY • link 11.5 years ago by swbarnes2 14k

score 0 · Answer 3 · 2012-10-19

Perhaps you need to modify the arguments passed to tophat as the contraints (like the number of mismatches allowed, maximum size of intron and so on) to increase the number of accepted hits.

The other possibility is try to provide a better reference genome, if it's not a good one, or perhaps to see if your RNA-seq experiment is well done.

score 0 · Answer 4 · 2012-10-19

0

Entering edit mode

11.5 years ago

siddharth.sethi5 ▴ 270

Did you check your "junctions.bed" file( It should be empty in your case)? I think thats because tophat is not able to find the splice junctions.

I was getting this problem and i think I changed the -r (innermatedist) parameter in tophat to fix this

ADD COMMENT • link 11.5 years ago by siddharth.sethi5 ▴ 270

0

Entering edit mode

Hi

This was single end data. So -r was not required. My junctions.bed file was not empty either..

Regards

varun

ADD REPLY • link 11.5 years ago by Varun Gupta ★ 1.3k