Biostar Beta. Not for public use.
Extract Reads From A Bam File That Fall Within A Given Region
21
Entering edit mode
6.0 years ago
abi • 280

Hi all

This question could be considered as a follow up of this discussion. http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 What I need is to extract reads from bam file that fall only within a given region (not overlap the given region), the region being in the form of a gff file or bed file. Overlapping reads could be extracted by several methods (as in the discussion mentioned or BEDTools). The idea is to try to be pretty sure of excluding 5´ UTRs in the process of detecting intergenic transcripts. I saw a tool in BamUtil (http://genome.sph.umich.edu/wiki/BamUtil) called "writeRegion" which would pretty much do what I want. Somehow could not get this running for my dataset. Was wondering if you guys might have an "R" or some other solution for this. Thanks in advance Abi

bam • 55k views
ADD COMMENTlink
2
Entering edit mode

I think the options have changed since this reply was written if you want the default headerless sam file then use the command as per Damian Kao

samtools view input.bam "Chr10:18000-45500" > output.sam

if it's a bam output that you want you will need a '-b'

samtools view -b input.bam "Chr10:18000-45500" > output.bam
ADD REPLYlink
2
Entering edit mode

Do not forget to include the -h flag whilst doing this, i.e., if you want to retain the SAM header, which will be required for many downstream applications.

-h include header in SAM output

ADD REPLYlink
1
Entering edit mode

Headers are automatically included when the output format is BAM since BAM requires the header to be included.

ADD REPLYlink
4
Entering edit mode

Without -h output.BAM has a header but IGV will not show any reads. In my case was necessary to add -h to display output.bam correctly

 samtools view -b -h input.bam "Chr10:18000-45500" > output.bam
ADD REPLYlink
21
Entering edit mode
3.0 years ago
USA

You can extract mappings of a sam/bam file by reference and region with samtools. For example:

samtools view input.bam "Chr10:18000-45500" > output.bam

That would output all reads in Chr10 between 18000-45500 bp.

ADD COMMENTlink
0
Entering edit mode

Thanks Dk for your answer, in your example the output will be all reads starting anywhere between 18000-45500. What I want is to get out only the reads which also end in this region. For example starting at 18000 or higher and ending at 45500 or lower. Abi

ADD REPLYlink
0
Entering edit mode

That is confusing. Would you please explain little bit more, which reads do you want in output?

ADD REPLYlink
0
Entering edit mode

I am sorry if I am confusing you guys. Lets say I have a 4 reads with the following start and end coordinates aligned on to the chromosome.

start coordinates of the reads: 900, 1000, 1100, 1900

end coordinates of the reads: 1200, 1500,1700, 2300

My region of interest is 1000:2000

I want my output to be in this case the second and third reads starting and ending within my region of interest.

The first and last reads should be excluded because even though they overlap my region of interest, they extend away from my region (1000:2000). I hope my question is clearer now. Abi

ADD REPLYlink
0
Entering edit mode

do you have spliced reads as well?

ADD REPLYlink
0
Entering edit mode

he wants the read with overlaps "any" at the left-end and "within" at the right end. If you consider the position 18000-45000, he needs all reads with overlap of at least 1 base to 18000, but all reads should end less than or equal to 45000.

ADD REPLYlink
0
Entering edit mode

Yep Arun thats what I want and BTW, I do not have spliced reads!

ADD REPLYlink
2
Entering edit mode

abi, this should be possible with R and IRanges. For loading a bam file, refer to this post. If you have questions about using IRanges, let me know. I'll post an example.

ADD REPLYlink
0
Entering edit mode

hi kao , can u pls tell me how to index this bam file because wen I tried to make index it showed error like: Bam header is absent. Also can u suggest any tool r the way to merge the overlapping region in bam file and make into single fasta file.

ADD REPLYlink
0
Entering edit mode

Use "index" from samtools, described here.

samtools index test.bam
ADD REPLYlink
0
Entering edit mode

How can you randomly extract reads from two bam files to simulate sample contaminations? I am trying to simulate sample contamination for different level of dilution for NGS samples. Suppose I have two bam files for SampleA and SampleB. I want to generate 5 contaminated samples at dilution of 10%, 20%, 30%,40% and 50% of those two samples. I understand that I should extract reads from one of the two bam files at the given dilution percentage and reassign to the other bam file, but I don't know exactly how to do this. Could you please explain me the procedure? Thanks

ADD REPLYlink
5
Entering edit mode
6.0 years ago
abi • 280

Thanks guys, Actually the answer was there in Michael´s explanation in http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 In his example, countOverlaps(genes, reads, type = "within") would do what I want. If type is within, the query interval must be wholly contained within the subject interval.

Abi

ADD COMMENTlink
0
Entering edit mode

yes, that's right. This is what I thought you'd want to go and look up and come back with questions. Seems the usage is already shown here. Nice!

ADD REPLYlink
1
Entering edit mode
12 months ago
Seattle, WA USA

Via BEDOPS bedmap --fraction-map 1 to return only reads which fall entirely within intervals in intervals.bed:

$ bedmap --echo --fraction-map 1 <(bam2bed < reads.bam) intervals.bed > answer.bed

If you want to do some ad-hoc search:

$ bedmap --echo --fraction-map 1 <(bam2bed < reads.bam) <(echo -e "chrZ\t1234\t5678") > answer.bed
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1