How can I get a bed file of refseq exons numbered according to their order on the transcript
1
3
Entering edit mode
9.7 years ago
Duarte Molha ▴ 240

Hi everyone.

I downloaded from UCSC a bed file containing all refseq exons. However, USCS numbers them according to genomic order, so a transcript on the reverse strand would have has his last exon numbered with index 0 instead of the last index.

Can anyone point me to how I can obtain a file where exons are numbered acording to their position on the transcript?

As an example SF3B1 is represented on USCS bed file like:

chr2    198256697    198257185    NM_012433_exon_0_0_chr2_198256698_r    0    -
chr2    198257695    198257912    NM_012433_exon_1_0_chr2_198257696_r    0    -
chr2    198260779    198261052    NM_012433_exon_2_0_chr2_198260780_r    0    -
chr2    198262708    198262840    NM_012433_exon_3_0_chr2_198262709_r    0    -
chr2    198263184    198263305    NM_012433_exon_4_0_chr2_198263185_r    0    -
chr2    198264778    198264890    NM_012433_exon_5_0_chr2_198264779_r    0    -
chr2    198264975    198265158    NM_012433_exon_6_0_chr2_198264976_r    0    -
chr2    198265438    198265660    NM_012433_exon_7_0_chr2_198265439_r    0    -
chr2    198266123    198266249    NM_012433_exon_8_0_chr2_198266124_r    0    -
chr2    198266465    198266612    NM_012433_exon_9_0_chr2_198266466_r    0    -
chr2    198266708    198266854    NM_012433_exon_10_0_chr2_198266709_r    0    -
chr2    198267279    198267550    NM_012433_exon_11_0_chr2_198267280_r    0    -
chr2    198267672    198267759    NM_012433_exon_12_0_chr2_198267673_r    0    -
chr2    198268308    198268488    NM_012433_exon_13_0_chr2_198268309_r    0    -
chr2    198269799    198269901    NM_012433_exon_14_0_chr2_198269800_r    0    -
chr2    198269998    198270196    NM_012433_exon_15_0_chr2_198269999_r    0    -
chr2    198272721    198272843    NM_012433_exon_16_0_chr2_198272722_r    0    -
chr2    198273092    198273305    NM_012433_exon_17_0_chr2_198273093_r    0    -
chr2    198274493    198274731    NM_012433_exon_18_0_chr2_198274494_r    0    -
chr2    198281464    198281635    NM_012433_exon_19_0_chr2_198281465_r    0    -
chr2    198283232    198283312    NM_012433_exon_20_0_chr2_198283233_r    0    -
chr2    198285151    198285266    NM_012433_exon_21_0_chr2_198285152_r    0    -
chr2    198285752    198285857    NM_012433_exon_22_0_chr2_198285753_r    0    -
chr2    198288531    198288698    NM_012433_exon_23_0_chr2_198288532_r    0    -
chr2    198299695    198299771    NM_012433_exon_24_0_chr2_198299696_r    0    -

What I would require would be something like :

chr2    198256697    198257185    NM_012433_exon_25    0    -
chr2    198257695    198257912    NM_012433_exon_24    0    -
chr2    198260779    198261052    NM_012433_exon_23    0    -
chr2    198262708    198262840    NM_012433_exon_22    0    -
chr2    198263184    198263305    NM_012433_exon_21    0    -
chr2    198264778    198264890    NM_012433_exon_20    0    -
chr2    198264975    198265158    NM_012433_exon_19    0    -
chr2    198265438    198265660    NM_012433_exon_18    0    -
chr2    198266123    198266249    NM_012433_exon_17    0    -
chr2    198266465    198266612    NM_012433_exon_16    0    -
chr2    198266708    198266854    NM_012433_exon_15    0    -
chr2    198267279    198267550    NM_012433_exon_14    0    -
chr2    198267672    198267759    NM_012433_exon_13    0    -
chr2    198268308    198268488    NM_012433_exon_12    0    -
chr2    198269799    198269901    NM_012433_exon_11    0    -
chr2    198269998    198270196    NM_012433_exon_10    0    -
chr2    198272721    198272843    NM_012433_exon_9    0    -
chr2    198273092    198273305    NM_012433_exon_8    0    -
chr2    198274493    198274731    NM_012433_exon_7    0    -
chr2    198281464    198281635    NM_012433_exon_6    0    -
chr2    198283232    198283312    NM_012433_exon_5    0    -
chr2    198285151    198285266    NM_012433_exon_4    0    -
chr2    198285752    198285857    NM_012433_exon_3    0    -
chr2    198288531    198288698    NM_012433_exon_2    0    -
chr2    198299695    198299771    NM_012433_exon_1    0    -

Can someone help? I know I could write a script to convert these but I have trouble believing there isn't a source our there already containing the information in this format... However my google skills seem inadequate because I have failed to find them

Many thanks
Duarte

exon-numbering bed refseq • 4.7k views
ADD COMMENT
2
Entering edit mode
9.7 years ago

It's just a couple lines of R code.

library(rtracklayer)
d <- import.bed("foo.bed")
#Add transcript names, since they're missing
d$name <- sapply(strsplit(d$name,"_exon_", fixed=T), function(x) return(x[1]))

dl <- split(d, d$name)
add_exons <- function(x) {
    nums = c(1:length(x))
    if(as.character(strand(x)[1]) == "-") {
        nums = rev(nums)
    }
    x$name = sprintf("%s_exons_%i", x$name, nums)
    return(x)
}
d <- unlist(endoapply(dl, add_exons))
export.bed(d, "foo2.bed")
ADD COMMENT
2
Entering edit mode

Thanks Devon

I knew I could code it as well... I personaly prefer to do it in awk and a couple if bash utils :) but I was wondering if there would be a ready made resource where I could get this that I might not be aware of :)

Seems to be to be the logical numbering way of ordering exons... genomic index order you can always obtain from the coordinates themselves :)

Many thanks
Duarte

PS... in case you want to know how I did it, assuming you have a bed file with the refseq exons like the ones I indicated in the post then you can just do 2 awk commands:

For the reverse strand transcripts (exons on the reverse strand - invert indexing)

awk '
  {
    if ($6 == "-") {
      split ($4,a,"_exon_");
      print $1,$2,$3,a[1],$5,$6
    }
  }' refseq_exons.bed | \
sort -k 4,4 -k 2nr,2 | \
awk '
  BEGIN {
    i=1;
    t="";
    OFS="\t"
  }
  {
    if(t == $4) {
      print $1,$2,$3,$4"_Exon_"i++,$5,$6;
    }
    else {
      i=1;
      t=$4;
      print $1,$2,$3,t"_Exon_"i++,$5,$6;t=$4
    }
  }'
> refseq_exons_transcript_numbered.bed

For the forward strand transcripts (exons on the same as genomic order)

awk '
{
  if ($6 == "+") {
    split ($4,a,"_exon_");
    print $1,$2,$3,a[1],$5,$6
  }
}'  refseq_exons.bed | \
sort -k 4,4 -k 2n,2 | \
awk '
  BEGIN {
    i=1;
    t="";
    OFS="\t"
  }
  {
    if(t == $4) {
      print $1,$2,$3,$4"_Exon_"i++,$5,$6;
    }
    else {
      i=1;
      t=$4;
      print $1,$2,$3,t"_Exon_"i++,$5,$6;t=$4
    }
  }'
>> refseq_exons_transcript_numbered.bed
ADD REPLY
0
Entering edit mode

You might try the Ensembl annotations instead. They tend to be much more coherently formatted.

ADD REPLY
0
Entering edit mode

Ensembl Exons are not linked to any particular transcript so they do not have order numbers

ADD REPLY
0
Entering edit mode

Sure they do, though at least the mouse annotation has the same numbering annoyance.

ADD REPLY
0
Entering edit mode

Where can you get the order number? I have been using the perl API and I cannot find any reference to the order number.

I believe that ensembl does not link each exon to a specific transcript (because many transcripts share 1 or more exons) so it does not make sense to add a order number to each exon.

But If I am wrong would you mind sharing with me how I can access that information ? It would be very useful for me

ADD REPLY
0
Entering edit mode

Just download the annotation GTF file rather than trying to use the perl API. The annotation file itself will have multiple transcripts with the exons for each. If an exon is in multiple transcripts, it'll appear once for each.

ADD REPLY
0
Entering edit mode

BTW, nice awk solution.

ADD REPLY

Login before adding your answer.

Traffic: 1513 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6