Refseq fasta files from different species
1
1
Entering edit mode
7.8 years ago

Hello All,

I'm new to the forum and I'm looking for a bit of help. I'd like to pull down the Refseq fasta protein sequences of a gene (all isoforms), however I'd like to do that with several species. For example, I'd like to get the gene ACTB for human (which has only 1 isoform NP_001092.1), but also get sequences for mouse (NP_031419.1), and macaca fascicularis ( NP_001271954.1). This is easy for an example since each only has one isoform, and the monkey assembly is well annotated for this gene so it doesn't have the XP_ type sequence it usually does.

This procedure is a bit tricky because sometimes the gene I'm pulling may not exist in one of the other species (i.e. not present in mouse), and importantly I need them all to be from the same reference source build (i.e. should all be RefSeq or all Ensembl). I have a very strong preference for RefSeq though.

Here's what I've tried using other methods in R, but no success:

library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(rentrez)
symbols <- c("^ACTB$")
grep(symbols, keys(x = org.Hs.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in human
grep(symbols, keys(x = org.Mm.eg.db, keytype = "SYMBOL"), ignore.case = T, perl = T) #test if gene found in mouse
keys.h <- keys(x = org.Hs.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Hs.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
keys.m <- keys(x = org.Mm.eg.db, keytype = "SYMBOL")[grep(symbols, keys(x = org.Mm.eg.db, 
                       keytype = "SYMBOL"), ignore.case = T, perl = T)]
#map HGNC to ID#
id.h <- mapIds(x = org.Hs.eg.db, keys = keys.h, keytype = 'SYMBOL', column = 'ENTREZID')
id.m <- mapIds(x = org.Mm.eg.db, keys = keys.m, keytype = 'SYMBOL', column = 'ENTREZID')
select(x = org.Hs.eg.db, keys = id.h, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
select(x = org.Mm.eg.db, keys = id.m, columns = c("ENTREZID", "REFSEQ", "SYMBOL"))
#pull Human protein data
id.link <- entrez_link(dbfrom = "gene", id = id.h, db = "protein")
id.prot <- id.link$links$gene_protein_refseq #this is where no protein is found
id.seqs <- entrez_fetch(db = "protein", id = id.prot, rettype = "fasta")

As you can see I'm able to retrieve human and mouse decently, but unfortunately I can't figure out a way to pull the monkey protein sequences. I can't find an equivalent "org.Mf.eg.db"database for the monkey, that would pull the NP_001271954.1 sequence. Can anyone help me figure this one out?

Thank you in advance!

sequence R NCBI RefSeq ortholog • 2.3k views
ADD COMMENT
0
Entering edit mode

Hey All,

It's been about 3 weeks and 102 views, so it seems like an interesting topic to people. Could anyone please assist?

ADD REPLY
1
Entering edit mode
7.8 years ago

well for any poor sod that comes across this post and wanted to know, I solved my own problem by utilizing the BSGenomes package and GenomicFeatures package:

> library(GenomicFeatures) cyno.5.0 <- makeTxDbFromUCSC(genome =
> "macFas5", tablename = "refGene") saveDb(x = cyno.5.0, file =
> "TxDb_cyno_UCSC_5_refGene")

This will build a TxDb object that can be used to get the sequences for translation. A bit convoluted but works for the less used organisms. This could be amended for another species by simply scanning other available builds (uses rtracklayer package loaded as dependency):

ucscGenomes() #scroll through to find relevant organisms
ADD COMMENT

Login before adding your answer.

Traffic: 2094 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6