Question

Exon coordinates and sequence

2

Entering edit mode

6.4 years ago

1769mkc ★ 1.2k

I have this problem so i want to take out all the exon coordinates and their respective sequences .One way is I take out the coordinates of exon from a gtf file then give that as input to ucsc genome browser , i get the DNA sequences ,but it looks like i get intronic sequences too.

So any suggestion or help how to do it better way .I m not how to proceed .As always sugesiton or help would highly appreciated

RNA-Seq • 5.4k views

ADD COMMENT • link updated 2.6 years ago by asalimih ▴ 60 • written 6.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Hi where you gtf file come from ? There is few possibility for this error like not the same NM gene/transcript version so the coordinates from gtf files and ucsc are not the same.

ADD REPLY • link 6.4 years ago by Titus ▴ 910

0

Entering edit mode

i have the gtf file from gencode ...so what do you suggest?

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

You can extract coordinates of exon from ucsc (https://genome.ucsc.edu/cgi-bin/hgTables ) to be sure you compare the same ENST

ADD REPLY • link 6.4 years ago by Titus ▴ 910

0

Entering edit mode

Have you looked at bedtools getfasta? This does what you are looking for if I understood the question correctly.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

no i haven;t looked into it so i have this problem which is as such , I want to get the exon coordinates and the respective sequence ,then i want to use that exon sequence for primer designing.So is it possible for the bedtools getFasta to help in that?

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

You need a gtf, gff or bed of the exons of interest. That shouldn't be too hard to find. Then bedtools getfasta can be used to slice out a fasta record per exon (or any other interval) which you can then use for primer design.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

okay so you saying i need the exon coordinates and using those i can slice fasta sequence from a fasta file ?

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

Yes, you use the reference genome fasta and slice out the fragments of interest from the bed/gtf.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

thank you will try your suggestion

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

I have extracted the field containing chr, start, end , +/-, and the column containing exon/transcript/cds . I have a made field , now i want to extract only those rows which contain exon only , I did it R its straight forward but i'm not sure if I can retain the bed format in R or not so how do i parse only the exon only ?from the bed file...

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

its simple awk so i did it anyways thank you for your input im using that only

ADD REPLY • link 6.4 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

6.3 years ago

Kristoffer Vitting-Seerup ★ 4.0k

If your data is form GENCODE the easiest is simply to download the corresponding FASTA file from gencodeGenes - here is the link to the most recent human release but it is important you select the right release.

Alternatively I've used Cufflinks gffRead to extract the sequences of a GTF file with sucess multiple times.

ADD COMMENT • link 6.3 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

i read about the gffRead but i haven't tried it yet

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

here is an example of extracting only the exons of every transcript of a gff file:

gffread -w <out.fasta> -g <genome.fasta> <annotation.gff>

source

ADD REPLY • link 2.6 years ago by asalimih ▴ 60

score 4 · Accepted Answer · 2018-01-11

4

Entering edit mode

6.3 years ago

sacha ★ 2.4k

I did it like that:

1- Download refGene.txt.gz and hg19.fasta from the UCSC goldenpath. ( note: convert hg19.2bit to hg19.fa using twoBitToFa )
2- Create a bed file with exon coordiniate using my awk script

 // to_transcript.awk 

BEGIN 
{
 OFS ="\t"
}
{
    name=$2
    name2=$13
    sens = $4 =="+" ? "forward" : "reverse"
    chrom=$3

    split($10,exon_starts,",")
    split($11,exon_ends,",")


    for (i=1; i<length(exon_starts); i++)
    {
        start = exon_starts[i];
        end   = exon_ends[i];
        print(chrom, start, end, name2,sens, name)
    }

}

And run it :

  zcat refGene.txt.gz|awk -f to_transcript.awk > exon.bed

3 - Extract sequence using bedtools

  bedtools getfasta -fi hg19.fa -bed exon.bed -fo exon.fa -name

4 - You may want to reverse sequence for gene placed on the reverse strand. Split exon.bed into and exon_minus.bed and exon_plus.bed using grep on column 5th. And do same as I described above to have exon_minus.fa and exon_plus.fa You can then revert your sequence using :

   seqtk seq -r exon_minus.fasta> exon_minus.rev.fasta

and concat both file :

   cat exon_plus.fasta exon_minus.rev.fasta > all_exons.fa

ADD COMMENT • link 6.3 years ago by sacha ★ 2.4k

0

Entering edit mode

will try it thank you for making the script and let you know

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

@sacha

name=$2
    name2=$13
    sens = $4 =="+" ? "forward" : "reverse"
    chrom=$3
    exon_count=$9
    tx_start=$5
    tx_end=$6

    split($10,exon_starts,",")
    split($11,exon_ends,",")

as i use shell scripting for very basic purspose like write one liner but never use loop as such mostly i use R , so what i understand you are defining each field such as name=$2 is that what you are doing in the script , let me know if I'm understanding your script ...I would be glad if you can explain me..the awk part rest I used like bedtools but not seqtk...

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

I read the column description from ucsc table : https://genome.ucsc.edu/cgi-bin/hgTables. ( look for NCBI refseq and click on "describe table shema" )
As you see, each column correspond to :
name : transcript name
name2 : gene name
sens : gene direction
chrom: the chromosome where the gene is
exon_count : how many exon they are
tx_start : position of transcript's start
tx_end: position of transcript's end
exon_starts and exon_ends contains start and end positions of exon separated by comma.
So to get all exons coordinates, you should loop over those last fields. "split" function cut the field by comma and give you an array where you can loop over.

PS : the following statement are unused. You can remove it :
exon_count=$9
tx_start=$5
tx_end=$6

ADD REPLY • link 6.3 years ago by sacha ★ 2.4k

0

Entering edit mode

looks like the bed file exon coordinates i got from your script it doesn;t have tab space if i understand looking at it...

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

after running the awk script when i'm running bedtools to get fasta i get error like it looks like you have less than 3 columns at line :1 .Are you sure your files are tab-delimited?

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

yes, sorry... I just fixed that above by adding BEGIN { OFS = "\t" }

ADD REPLY • link 6.3 years ago by sacha ★ 2.4k

0

Entering edit mode

yes i added in the print command of the last line now it runs perfectly fine and Im able to get the sequence as well using the getfasta command as well so all these sequences are mature exon sequences for each gene isn;t it i don't see any interval when earlier I parsed out all the exon sequences like for 1 gene there might be 3 exon so i get those interval now it's gone ,am i correct?

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

1

Entering edit mode

One sequence = exon1+exon2+...+exonN

ADD REPLY • link 6.3 years ago by sacha ★ 2.4k

0

Entering edit mode

okay so it's the total mature sequence , but as i see the output like this .

>DDX11L1
CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATAGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCA
>DDX11L1
GTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAG

one of the output using your script ..so here i see two DDX11L1 exon i suppose is this the way am I suppose to get output ?or am I doing something wrong

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

There are many transcripts for one gene. You can replace the gene name by the transcript name to have a unique identifier :

    print(chrom, start, end, name2,sens, name)

by

    print(chrom, start, end, name,sens, name)

ADD REPLY • link 6.3 years ago by sacha ★ 2.4k

0

Entering edit mode

so i wouldn't go for name as coordinates is better for me since my idea of getting the coordinates of all the exon sequence i would go for start and end chromosome only .and then the sequence and join them as one mature sequence i have made a separate post for this like new question here it is have a look

ADD REPLY • link 6.3 years ago by 1769mkc ★ 1.2k