I need to get annotations for co ordinates that are from junctions.bed file (tophat) eg.
file1:
1 4764597 4767606 -
1 4764597 4766491 -
1 4766882 4767606 -
1 4767729 4772649 -
1 4767729 4768829 -
1 4767729 4775654 -
1 4772382 4772649 -
1 4772814 4774159 -
using ensembl mouse.37.67 gtf or gff what is the best way to get the gene id and gene name for these co ordinates(~211000 rows)?
couple of questions regarding this 1) read that bedtools can do this but not sure if it takes gtf (from documentation it only takes BAM/BED/GFF/VCF) 2) can i convert gtf to gff? 3) if i have to code myself how do i factor the (-) or (+) strand
any pointers appreciated.
the question is do we need to account for strand while performing such overlap and annotation.
It depends on your
junctions.bed
and annotations files. If your reference file is stranded and the annotations are stranded, then separate the reference file by strand, separate the annotations by strand, and run the set operations as shown on the separated files.For the forward-stranded sets:
For the reverse-stranded sets:
Then you could do a multiset union operation on the stranded answers to get a single file with all your overlaps:
awesome thanks will try..
Please remember to change the column number when running
awk
onjunctions.bed
, if the strand is not in the sixth column.`/bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed < Mus_musculus.NCBIM37.67.gtf) > answer.bed Usage: gtf2bed FILE.gtf
gtf2bed: error: Expected single argument (GTF file)`
not sure why is it throwing this error.
The
convert2bed
binary (what runs the BEDOPS version ofgtf2bed
) would not print any error messages with that text. Do you possibly have another non-BEDOPS script calledgtf2bed
in your environment's PATH? You might do something likewhich gtf2bed
orlocate gtf2bed
to find out where this non-BEDOPS script is, and rename or move it out of your PATH.removed the "<" sign and it did not throw any error but my answer.bed file just has co ordinates no meta info,
./bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed Mus_musculus.NCBIM37.67.gtf) > answer.bed
input file top few linesoutput file is the same..
You must pass input to
gtf2bed
via standard input. In other words,gtf2bed in.gtf
will fail if you are using the BEDOPS version ofgtf2bed
. You must specifygtf2bed < in.gtf
if you are using the BEDOPS version ofgtf2bed
. If you are using a non-BEDOPS version ofgtf2bed
, I don't know what it produces, so I don't know if or how it will work.If you want to be more explicit and also avoid problems with some other non-BEDOPS tool on your system, you can also use
convert2bed --input=gtf --output=bed < in.gtf
in place ofgtf2bed < in.gtf
. Also, you should have BEDOPS tools findable in your PATH (e.g. copied to/usr/local/bin
or similar), or this command will fail.worked like charm, thank you for prompt response!
I had a question regarding the output, some places show blank against the co ordinates and some of them have multiple gene names(i was wondering that junctions should be unique), but when run it via biomart they have geneid and name for eg bedmap output
ensembl output
Some regions will be blank if there are no annotations which overlap those regions.
Some regions will have more than one annotation mapped to them, if there are multiple overlaps.
In
bedmap
, you could use--skip-unmapped
to remove elements without overlaps. You can also add the--fraction-map 1
option to ensure that 100% of the annotation overlaps — falls entirely within — your region of interest.This can be too stringent, so you could adjust this fractional value between >0.0 (almost no overlap) and 1.0 (full overlap).
Take a look at the
--fraction-*
options inbedmap --help
or the online docs for a fuller explanation of these constraint operations.Also, your regions of interest are not sorted correctly, which can cause issues with the results.
Use
sort-bed junctions.bed > junctions.sorted.bed
and then usejunctions.sorted.bed
for any further downstream set operations, or you could get incorrect output.