Extracting regions from .bam file using a .gff or .gtf file
3
6
Entering edit mode
10.0 years ago
Tom A ▴ 110

Hello,

What is the most efficient way to extract RNA-seq reads from a .bam file that match to a genomic region defined in a .gff or .gtf file?

I'm new to this so my terminology may be incorrect, but I basically want to "merge" a .bam file with a .gff (or .gtf) file to extract a list of the sequences that fall into that defined region.

Thanks in advance.

bam gtf RNA-Seq gff • 15k views
ADD COMMENT
1
Entering edit mode

Regarding your terminology, I would call what you're trying to do "intersecting" the files instead of "merging"

ADD REPLY
3
Entering edit mode
10.0 years ago

You could do this with the BEDOPS toolkit.

First, convert the datasets into BED files, e.g.:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < annotations.gff > annotations.gff.bed
$ gtf2bed < annotations.gtf > annotations.gtf.bed

BEDOPS tools are written to work with standard input and output streams, so you can use piping to filter annotations for classes of elements, for instance:

$ gff2bed < annotations.gff | grep -w "gene" > genes.gff.bed
$ gtf2bed < annotations.gtf | grep -w "exon" > exons.gtf.bed
...

Then do set or map operations on BED files, e.g. to get all reads that overlap annotations, use bedops --element-of:

$ bedops --element-of -1 reads.bed annotations.gff.bed > justReadsThatOverlapAnnotations.bed

And to get both reads and their associated, overlapping annotations, use bedmap --echo --echo-map:

$ bedmap --echo --echo-map reads.bed annotations.gtf.bed > readsWithAssociatedAnnotations.bed

Note here that set operations return members of the reference set (or subsets thereof), while map operations return elements from both reference and map sets. The documentation goes into more detail about both operations and arguments for each application.

ADD COMMENT
0
Entering edit mode

Thank you Alex! I will give it a shot.

ADD REPLY
0
Entering edit mode

Hope this helps! I check in here every few days, so if you have any questions about using these tools, feel free to post them here or on our user forum.

ADD REPLY
0
Entering edit mode

Ok, so Im not very good with Unix. But after installing and running all of the make commands and the cp bin/* /usr/local/bin command, I get the following error:

[bam2bed] - The samtools binary could not be found in your user PATH -- please locate and install this binary

even though the binary is in the same directory I am in. Any thoughts?

ADD REPLY
0
Entering edit mode

Make sure samtools is in /usr/local/bin or somewhere in your PATH where the script can find it (run echo $PATH to see which directories will be searched through, for binaries like samtools and others).

ADD REPLY
1
Entering edit mode
10.0 years ago

The simplest way would be to convert the GTF/GFF file to a BED format (you lose a lot of information, of course), and then use samtools view -hL regions.bed alignments.bam > alignments_in_regions.sam.

Now having said that, it'd be good to know what you plan to do with the reads as there's likely a better way to do what you actually want.

ADD COMMENT
0
Entering edit mode

Thank you Devon. How does one convert to a BED format? Is there a samtools command for that as well?

ADD REPLY
0
Entering edit mode
awk '{printf("%s\t%s\t%s\t.\t%s\t%s\n",$1,$4,$5,$6,$7)}' file.gtf > file.bed

is the simplest method, though you probably want to only look at exons or something like that, so just add a test for that.

ADD REPLY
0
Entering edit mode

As I can see from https://bedtools.readthedocs.io/en/latest/content/general-usage.html

BED6: A BED file where each feature is described by chrom, start, end, name, score, and strand.

For example: chr1 11873 14409 uc001aaa.3 0 +

and format of Ensembl gtf file is

1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";

hence, in order to include gene_id in "name", I have tweaked your command a little

awk '{gsub(/"|;/, "", $10); printf("%s\t%s\t%s\t%s\t%s\t%s\n",$1,$4,$5,$10,$6,$7)}' Homo_sapiens.GRCh38.87.gtf > Homo_sapiens.GRCh38.87.gtf.bed

Will this be a valid approach?

ADD REPLY
0
Entering edit mode

Not only is this easily done w/ samtools, but the user proably already samtools installed (or should if he doesn't ;)

ADD REPLY
1
Entering edit mode
10.0 years ago
Chris Fields ★ 2.2k

As Devon mentioned it completely depends on what you are trying to accomplish. Are the regions pretty self-contained or discontinuous? Is the sequence data from an RNA-Seq experiment, where you have to take splice junctions into consideration?

For RNA-Seq I have used both htseq-count</a> and featureCounts to get counts via GTF (I am more consistently using the latter as it takes multiple BAM files as input and the results are the same as htseq). If you have GFF you can convert to GTF using Cufflink's excellent gffread tool.

You can also get counts from a BED file as mentioned above by Devon and Alex, though these (in general) require a little more fiddling (e.g. may require bed12 output). bedtools is what we have traditionally used; not familiar with BEDOPS yet though I may have to give it a try.

ADD COMMENT

Login before adding your answer.

Traffic: 2038 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6