In the tophat folder (generated by running tophat), there is a file called align_summary.txt. In the example below, it reports that out of 92923521 aligned pairs, 9.6% have multiple alignments, which means that there must be a way to generate a file containing only the uniquely aligned pairs (those that have only a single match as a pair). The accepted_hits.bam file, if I understand correctly, contains both, unique and multiple alignments pairs. Which command can I use to extract only the uniquely aligned pairs?
Also, if the junctions.bed file contains splice junctions which uniquely mapped based on alignments as well as multimappers, which command can I use to extract only the uniquely aligned? And since this is .bed rather than .bam format, is the pairing information utilized for determining the uniqueness of the splice junction read?
In the following 3 posts the code 255 (in both accepted_hits.bam and junctions.bed) is explained as either meaning that the mapping quality score is not available or that read has a single match in the reference: https://www.biostars.org/p/69773, https://biostar.usegalaxy.org/p/4077, https://www.biostars.org/p/16653 --- what is the final word on this? And if it means single match, does it refer to reads independently of pairing or also if as a pair the match is unique even if each read on its own could have more matches?
Another solution for accepted_hits.bam file is: samtools view -bq 1 file.bam > unique.bam, where unique reads = single read mapping at one best position, suggested in the following posts: https://www.biostars.org/p/56246/#56251 and http://seqanswers.com/forums/archive/index.php/t-6189.html --- here too, if as a pair the match is unique, even if each read on its own could have more matches, is it accepted as unique?
Alternatively, the following two posts suggest that using -g/--max-multihits 1 when running tophat will generate accepted_hits.bam file with reads that have only a single match in the genome, though I do not know whether this command will also include reads which have only 1 match as a pair but not each on its own? https://www.biostars.org/p/56331 and http://seqanswers.com/forums/showthread.php?t=8548
I assume that for all of the above, for a single match in the reference the reads/pairs tophat alignment run need to be set to disallow any mismatches, because if allowed, then more would get multiple matches and get excluded. On the other hand, if read/pair has 0 matches, it is preferable to allow at least 1 mismatch. How could such combined approach be implemented? The following post discusses the thresholds for alignment quality and I wonder if this could be used along with other approaches, so that reads pairs that have no matches would be allowed to have one mismatch? https://www.biostars.org/p/101533/#101551 samtools view -b -q 10 foo.bam > foo.filtered.bam, where -q value represents the likelihood that an alignment is correct.
I do not have the expertise to evaluate all these approaches, and would appreciate advice. Thank you.
Mapped: 98314933 (76.2% of input)
of these: 11898655 (12.1%) have multiple alignments (9004 have >20)
Mapped: 95536410 (74.1% of input)
of these: 10769172 (11.3%) have multiple alignments (2289 have >20)
75.1% overall read alignment rate.
Aligned pairs: 92923521
of these: 8913959 ( 9.6%) have multiple alignments
and: 1899417 ( 2.0%) are discordant alignments
70.6% concordant pair alignment rate.