We found that most of the reads in a large number of samples have been ligated together from various fragments. I am trying to see if I can identify and salvage those reads that managed to slip through the cracks and are whole/continuous.
I can visualize the alignments and see large gaps among many of the reads within the same organism, as well as a read being split between two. I think this is sufficient proof of the problem and also proof that the information to identify continuous reads can be found in the SAM file.
Can anyone help me identify reads that mapped continuously (barring reasonable INDELs)? These are 150 bp reads and my threshold for continuity is flexibly around no gap larger than the size of the read. Alternatively, identifying all reads that are not continuous gets the job done just as well.
Have you considered trimming the reads to 50bp, for example? Those would be far less likely to be chimeric. It might be easier and more productive than throwing out all the problematic ones (assuming that most reads are affected).
We are considering that option, but that still leaves a risk of chimeric reads being included, depending on how the trimming worked out.
We're hoping to be able to identify about 30-40% of the reads as non-chimeric. If we can't identify even 15% as non-chimeric, we'll probably resort to trimming.
We also have a need for as complete a read as possible. There is a lot of value for our particular experiment in getting 130-150 bp alignments that are complete and continuous.
Thanks for the suggestion!
See this discussion about split reads: Split Read in Samtools
That might be what you are looking for.
The solution is listed as
samtools view -f 256 Input.bam | awk '$6 ~/S/ && $7 == "=" {print $0}' > Secondary_clipped.sam
But, this pulls out secondary alignments and then checks if there is a soft clipping event. Secondary alignments may happen, but this won't necessarily capture all chimeric reads. Soft clipping is likely as well, but I don't know if it will capture everything I'm looking for.