how to get gene names and gene id given co ordinates from ensembl
2
0
Entering edit mode
7.0 years ago

I need to get annotations for co ordinates that are from junctions.bed file (tophat) eg.

file1:

  1 4764597 4767606 -
  1 4764597 4766491 -
  1 4766882 4767606 -
  1 4767729 4772649 -
  1 4767729 4768829 -
  1 4767729 4775654 -
  1 4772382 4772649 -
  1 4772814 4774159 -

using ensembl mouse.37.67 gtf or gff what is the best way to get the gene id and gene name for these co ordinates(~211000 rows)?

couple of questions regarding this 1) read that bedtools can do this but not sure if it takes gtf (from documentation it only takes BAM/BED/GFF/VCF) 2) can i convert gtf to gff? 3) if i have to code myself how do i factor the (-) or (+) strand

any pointers appreciated.

RNA-Seq Ensembl • 5.9k views
ADD COMMENT
6
Entering edit mode
7.0 years ago

Via BEDOPS bedmap and gtf2bed:

$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed < annotations.gtf) > answer.bed

Replace --echo-map-id-uniq with --echo-map if you want the entire annotation.

Replace gtf2bed with gff2bed if your annotations are in GFF format.

You can use awk to easily split UCSC BED files by stranded subsets, before operation.

$ awk '($6=="+")' in.bed > in.forwardStranded.bed

Replace the 6 in $6 with the column number of your strand, if it is not in the sixth column (per UCSC).

ADD COMMENT
0
Entering edit mode

the question is do we need to account for strand while performing such overlap and annotation.

ADD REPLY
1
Entering edit mode

It depends on your junctions.bed and annotations files. If your reference file is stranded and the annotations are stranded, then separate the reference file by strand, separate the annotations by strand, and run the set operations as shown on the separated files.

For the forward-stranded sets:

$ awk '($6=="+")' junctions.bed > junctions.fw.bed
$ gff2bed < annotations.gff | awk '($6=="+")' > annotations.fw.bed
$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.fw.bed annotations.fw.bed > answer.fw.bed

For the reverse-stranded sets:

$ awk '($6=="-")' junctions.bed > junctions.rv.bed
$ gff2bed < annotations.gff | awk '($6=="-")' > annotations.rv.bed
$ bedmap --echo --echo-map-id-uniq --delim '\t' junctions.rv.bed annotations.rv.bed > answer.rv.bed

Then you could do a multiset union operation on the stranded answers to get a single file with all your overlaps:

$ bedops --everything answer.fw.bed answer.rv.bed > answer.bed
ADD REPLY
0
Entering edit mode

awesome thanks will try..

ADD REPLY
1
Entering edit mode

Please remember to change the column number when running awk on junctions.bed, if the strand is not in the sixth column.

ADD REPLY
0
Entering edit mode

`/bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed < Mus_musculus.NCBIM37.67.gtf) > answer.bed Usage: gtf2bed FILE.gtf

gtf2bed: error: Expected single argument (GTF file)`

not sure why is it throwing this error.

ADD REPLY
0
Entering edit mode

The convert2bed binary (what runs the BEDOPS version of gtf2bed) would not print any error messages with that text. Do you possibly have another non-BEDOPS script called gtf2bed in your environment's PATH? You might do something like which gtf2bed or locate gtf2bed to find out where this non-BEDOPS script is, and rename or move it out of your PATH.

ADD REPLY
0
Entering edit mode

removed the "<" sign and it did not throw any error but my answer.bed file just has co ordinates no meta info, ./bedmap --echo --echo-map-id-uniq --delim '\t' junctions.bed <(gtf2bed Mus_musculus.NCBIM37.67.gtf) > answer.bed input file top few lines

11      98566330        98566433        -       
11      98566295        98566433        -       
11      98566581        98566836        -       
11      98566916        98567693        -       
11      98567792        98568140        -       
11      98568230        98568432        -       
11      98568230        98568851        -       
11      98568613        98568851        -

output file is the same..

ADD REPLY
0
Entering edit mode

You must pass input to gtf2bed via standard input. In other words, gtf2bed in.gtf will fail if you are using the BEDOPS version of gtf2bed. You must specify gtf2bed < in.gtf if you are using the BEDOPS version of gtf2bed. If you are using a non-BEDOPS version of gtf2bed, I don't know what it produces, so I don't know if or how it will work.

ADD REPLY
0
Entering edit mode

If you want to be more explicit and also avoid problems with some other non-BEDOPS tool on your system, you can also use convert2bed --input=gtf --output=bed < in.gtf in place of gtf2bed < in.gtf. Also, you should have BEDOPS tools findable in your PATH (e.g. copied to /usr/local/bin or similar), or this command will fail.

ADD REPLY
0
Entering edit mode

worked like charm, thank you for prompt response!

  11    98566330    98566433    -   ENSMUSG00000017210
  11    98566295    98566433    -   ENSMUSG00000017210
  11    98566581    98566836    -   ENSMUSG00000017210
  11    98566916    98567693    -   ENSMUSG00000017210
  11    98567792    98568140    -   ENSMUSG00000017210
  11    98568230    98568432    -   ENSMUSG00000017210
  11    98568230    98568851    -   ENSMUSG00000017210
ADD REPLY
0
Entering edit mode

I had a question regarding the output, some places show blank against the co ordinates and some of them have multiple gene names(i was wondering that junctions should be unique), but when run it via biomart they have geneid and name for eg bedmap output

  1 6235060 6235239 +   ENSMUSG00000025907
  1 6235425 6235506 +   ENSMUSG00000025907
  1 6235587 6235679 +   ENSMUSG00000025907
  1 6235740 6237787 +   ENSMUSG00000025907
  1 6235740 6235832 +   
  1 6237890 6238147 +   ENSMUSG00000025907
  1 6238273 6238357 +   ENSMUSG00000025907
  1 6240248 6251011 +   ENSMUSG00000025907
  1 6240248 6241012 +   ENSMUSG00000025907
  1 6240248 6240636 +   
  1 6251176 6252915 +   ENSMUSG00000025907
  1 4857613 4868108 +   ENSMUSG00000025903;ENSMUSG00000033813
  1 4868213 4876825 +   ENSMUSG00000025903;ENSMUSG00000033813;ENSMUSG00000062588

ensembl output

   Ensembl Gene ID  Associated Gene Name
   ENSMUSG00000025907   Rb1cc1
ADD REPLY
1
Entering edit mode

Some regions will be blank if there are no annotations which overlap those regions.

Some regions will have more than one annotation mapped to them, if there are multiple overlaps.

In bedmap, you could use --skip-unmapped to remove elements without overlaps. You can also add the --fraction-map 1 option to ensure that 100% of the annotation overlaps — falls entirely within — your region of interest.

This can be too stringent, so you could adjust this fractional value between >0.0 (almost no overlap) and 1.0 (full overlap).

Take a look at the --fraction-* options in bedmap --help or the online docs for a fuller explanation of these constraint operations.

ADD REPLY
1
Entering edit mode

Also, your regions of interest are not sorted correctly, which can cause issues with the results.

Use sort-bed junctions.bed > junctions.sorted.bed and then use junctions.sorted.bed for any further downstream set operations, or you could get incorrect output.

ADD REPLY
3
Entering edit mode
7.0 years ago
  1. bedtools intersect accepts a GTF file.
  2. GTF is a vague subversion (sometimes called 2.2) of GFF.
  3. bedtools can do strand-specific intersects, though whether that makes sense or not is up to you.
ADD COMMENT
0
Entering edit mode

thanks devon, I am confused with the results,

command used

bedtools intersect -wa -a mm9.gff -b junction.bed > bedtools.out.txt
cat junction.bed | wc -l
211618
cat bedtools.out.txt | wc -l 
746085

am I not suppose to get the same number of lines?

ADD REPLY
0
Entering edit mode

I imagine it intersects everything, so filter out the exons/CDS/transcript entries from the results.

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6