Bulk download introns, exons, and UTR regions from Ensembl for gene prediction training set
1
0
Entering edit mode
6.3 years ago

Hi, I would like to download labeled FASTA sequences of introns, exons, 5' UTR regions, and 3' UTR regions from a nonredundant set of human genes.

Ensembl allows me to do this for an individual gene by going to a page on an individual transcript variant (https://www.ensembl.org/Homo_sapiens/Transcript/Exons?db=core;g=ENSG00000139618;r=13:32889611-32973805;t=ENST00000544455) and clicking Download Sequence > FASTA.

Is there a way to automatically download a file like that for several thousand genes? I would like them all to be human (or at least mammalian) and protein-coding. Biomart seems to be down right now, and I'm willing to try to use the Perl, REST, or SQL APIs, but I have no experience with any of those, so some direction would be appreciated.

Ultimately I want a database of DNA sequences labeled as intron, exon, 5' UTR, or 3' UTR. If other databases (e.g. RefSeq) can provide it, that would be great too. Thanks!

sequence annotation intron exon ensembl • 3.0k views
ADD COMMENT
4
Entering edit mode
6.3 years ago

Check out the biomaRt R package, specifically the getSequence function which allows you to use a list of gene identifiers (Ensembl, or entrezgene) to retrieve sequences of interest by changing the seqType parameter (cdna, 3utr, 5utr, gene_exon, gene_intron, etc..)

library(biomaRt)
mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

Ensembl_IDs = c(ENSG00000139618, ENSG00000128731)

seqs = biomaRt::getSequence(id = Ensembl_IDs, 
           type="ensembl_gene_id",
           seqType = "gene_exon", 
           mart = mart)
ADD COMMENT

Login before adding your answer.

Traffic: 2885 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6