Biostar Beta. Not for public use.
Question: Extracting from tophat outputs reads pairs and splice-junctions with a single best match
7
Entering edit mode

In the tophat folder (generated by running tophat), there is a file called align_summary.txt. In the example below, it reports that out of 92923521 aligned pairs, 9.6% have multiple alignments, which means that there must be a way to generate a file containing only the uniquely aligned pairs (those that have only a single match as a pair). The accepted_hits.bam file, if I understand correctly, contains both, unique and multiple alignments pairs. Which command can I use to extract only the uniquely aligned pairs?

Also, if the junctions.bed file contains splice junctions which uniquely mapped based on alignments as well as multimappers, which command can I use to extract only the uniquely aligned? And since this is .bed rather than .bam format, is the pairing information utilized for determining the uniqueness of the splice junction read?

In the following 3 posts the code 255 (in both accepted_hits.bam and junctions.bed) is explained as either meaning that the mapping quality score is not available or that read has a single match in the reference: https://www.biostars.org/p/69773, https://biostar.usegalaxy.org/p/4077, https://www.biostars.org/p/16653 --- what is the final word on this? And if it means single match, does it refer to reads independently of pairing or also if as a pair the match is unique even if each read on its own could have more matches?

Another solution for accepted_hits.bam file is: samtools view -bq 1 file.bam > unique.bam, where unique reads = single read mapping at one best position, suggested in the following posts: https://www.biostars.org/p/56246/#56251 and http://seqanswers.com/forums/archive/index.php/t-6189.html --- here too, if as a pair the match is unique, even if each read on its own could have more matches, is it accepted as unique?

Alternatively, the following two posts suggest that using -g/--max-multihits 1 when running tophat will generate accepted_hits.bam file with reads that have only a single match in the genome, though I do not know whether this command will also include reads which have only 1 match as a pair but not each on its own? https://www.biostars.org/p/56331 and http://seqanswers.com/forums/showthread.php?t=8548

I assume that for all of the above, for a single match in the reference the reads/pairs tophat alignment run need to be set to disallow any mismatches, because if allowed, then more would get multiple matches and get excluded. On the other hand, if read/pair has 0 matches, it is preferable to allow at least 1 mismatch. How could such combined approach be implemented? The following post discusses the thresholds for alignment quality and I wonder if this could be used along with other approaches, so that reads pairs that have no matches would be allowed to have one mismatch? https://www.biostars.org/p/101533/#101551 samtools view -b -q 10 foo.bam > foo.filtered.bam, where -q value represents the likelihood that an alignment is correct.

I do not have the expertise to evaluate all these approaches, and would appreciate advice. Thank you.

Left reads:

          Input: 128979165

          Mapped:  98314933 (76.2% of input)

         of these:  11898655 (12.1%) have multiple alignments (9004 have >20)

Right reads:

          Input: 128979165

          Mapped:  95536410 (74.1% of input)

         of these:  10769172 (11.3%) have multiple alignments (2289 have >20)

75.1% overall read alignment rate.

Aligned pairs: 92923521

 of these:   8913959 ( 9.6%) have multiple alignments

      and:   1899417 ( 2.0%) are discordant alignments

70.6% concordant pair alignment rate.

ADD COMMENTlink 5.5 years ago trakhtenberg • 150 • updated 5.5 years ago Biostar 20
Entering edit mode
1

Regarding the BAM file, the newer versions of tophat2 will give a score of 50 for alignments that don't multimap (this used to be 255). Yes, 255 is defined in the specification as "unavailable", but what the specification says and people do aren't always the same. So, just samtools view -bq 4 accepted_hits.bam > filtered.bam to filter things. Note that -q 1 won't work, since tophat2 will give MAPQ scores of between 0 and 3 to multimappers (depending on how many times they map).

I don't know about the BED files, so I'm just making this a comment to not dissuade others from answering.

BTW, you might switch to STAR. The author is pretty quick with responses over on seqanswers.

ADD REPLYlink 5.5 years ago
Devon Ryan
90k
Entering edit mode
1

So, "samtools view -bq 4 accepted_hits.bam > filtered.bam" will select only unique mapping reads from the tophat2 accepted_hits.bam.

While ``samtools view -bq 1 works for bwa file to select unique mapping reads.

ADD REPLYlink 4.3 years ago
Chirag Nepal
♦ 2.2k
Entering edit mode
1

Yup!

ADD REPLYlink 4.3 years ago
Devon Ryan
90k
Entering edit mode
0

Thanks !

How do you filter discordant reads from filtered.bam. For eg: in the align_summary.txt provided bytrakhtenberg, there are about 2% of discordant reads. Tophat with options --no-discordant always fails, probably need to do through samtools. ``Any suggestions. Thanks !

ADD REPLYlink 4.3 years ago
Chirag Nepal
♦ 2.2k
Entering edit mode
0

You want bit 2 to be set.

ADD REPLYlink 4.3 years ago
Devon Ryan
90k
Entering edit mode
0

"You want bit 2 to be set." You mean during tophat or samtools ? Could you please show an example command.

thanks !

ADD REPLYlink 4.3 years ago
Chirag Nepal
♦ 2.2k
Entering edit mode
0

With samtools, so samtools view -bq 4 -f 2 ....

ADD REPLYlink 4.3 years ago
Devon Ryan
90k
Entering edit mode
0

thank you. I also found on this post http://seqanswers.com/forums/newreply.php?do=newreply&p=108623 that the reads from junctions.bed are actually included in the accepted_hits.bam (tagged by 'N' characters in the CIGAR string, 6th column), so I only need to filter the BAM file.

Although not critical to this discussion, I have another question which may influence the identification of uniquely aligned reads, and I posted it here https://www.biostars.org/p/111240/

ADD REPLYlink 5.5 years ago
trakhtenberg
• 150
Entering edit mode
1

An N CIGAR operation means spliced, so that's correct. BTW, if you were to use STAR, you would find that it annotates its junctions with the number of uniquely aligned reads supporting them (it also gives you total reads). This is convenient for filtering novel junctions.

ADD REPLYlink 5.5 years ago
Devon Ryan
90k
Entering edit mode
0

thank you for clarifying the meaning of N CIGAR. I will soon start practicing with STAR too.

ADD REPLYlink 5.5 years ago
trakhtenberg
• 150

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0