Extract read sequence from BAM files depending on the alignment coordinates
1
0
Entering edit mode
5.2 years ago
caggtaagtat ★ 1.9k

Hi,

I have amplicon sequencing data, which was trimmend and then aligned to my reference. Now I would like to access the sequence of every read, which was aligned to a short genomic region on one chromosome of my reference. So the region of interest is 630 to 640 on chromosome reference and I want the sequence of every read within this region.

I wanted to use samtools mileup,but it only gives me nucleotide resolution abundance unfortunatly and samtools view "ref:630-640" only gives me full length reads originating somewhere within the region. Is there maybe a tool for this purpose availible?

RNA-seq • 4.5k views
ADD COMMENT
0
Entering edit mode

Whatever you want to do, i'm sure it can be done with pysam, but I'm not quite sure what you want? What is wrong with what samtools view is doing?

ADD REPLY
0
Entering edit mode

I used samtools view to get the reads which overlap with my region of interest, but I then had to crop the sequences within the bam files, so that I always get the variation of the epitop of interrest within the reads. At the end, I used sam2pairwise, as cmdcolin suggested, and used the annotation within the output with the annotation of the alignment, to acces the respecitve sequences.

ADD REPLY
3
Entering edit mode
5.2 years ago
cmdcolin ★ 3.8k

This is not a complete solution, but you could post process the output of https://github.com/mlafave/sam2pairwise

For example, you could use samtools view file.bam reference:630-640 | sam2pairwise | script_that_processes_each_outputted_record_to_get_subsection where this last step is sort of an exercise, but is based on post processing the output

ADD COMMENT
0
Entering edit mode

Thank you, I can definitly work with that.

Would the samtools view file.bam reference:630-640 lead to reads which only start within the region? Or does it also ouput reads starting, for example at 600 and endong at 650? I used it on my data, but was not sure if the output is correct

ADD REPLY
0
Entering edit mode

It will output all reads overlapping the region, quick sanity check verifies that some reads with start coordinate less than my query do show up

ADD REPLY
0
Entering edit mode

Ah ok thanks, the reads with deletions confused me, since they start somewhere completely different but overlap my region with a deletion.

ADD REPLY
0
Entering edit mode

I think "samtools view" gives all read sequences, as long as the first base of mapping position is in your interval. So probably "samtools view" is not working for your purpose. "Sambamba slice" also seems to slice the bam files in this way, capturing all reads "touching/overlaping" the interval, even only one base.

ADD REPLY
0
Entering edit mode

I think this is not true. It will output reads overlapping the given region. The samtools documentation http://www.htslib.org/doc/samtools.html says "You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). "

ADD REPLY
0
Entering edit mode

Thank you! Slicing the bam files by genomic coordinates is exactly what I want at the end. I will give sambamba slice a try. I just tested it again and apparently, when using samtools view file.bam "chr:630-640" gives me all reads which somehow overlap with the region. Some reads start at position 34 with 200 mapping nucleotides, then a 2000nt long deletion and then again a short stretch of matching nucleotides.

ADD REPLY
0
Entering edit mode

I may have misstated information about "samtools view". I meant it would capture all read alignments touching/overlapping the input region, even the mapping position of first base is outside the region, as long as some part of the read overlap with the reference within this region. Sambamba works the same way, so you can't really "slice" the bam "sharply", cutting both reference and read alignment with "samtools view" and "sambamba slice". You're basically just slicing the reference and the mapped reads would have "overhangs" outside of the input region boundaries.

ADD REPLY
0
Entering edit mode

Ok, I will have to do it manually then by counting all mapped nucleotides from the first mapped position, I guess.

ADD REPLY

Login before adding your answer.

Traffic: 2433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6