retrieve Sequence by Gene name?
3
1
Entering edit mode
6.8 years ago
billneu ▴ 10

Please bear with my rookie questions:

I have in hand thousands of gene names from Mus musculus (house mouse), what I want to do is to get sequence of each gene and then use BLAST for pairwise calculations.

The gene names are like:

MATN1

1300002E11RIK

LYSMD1

IBTK

MS4A10

TRIM11
...

  1. I first tested to use NCBI blastn with any one of these gene names to the whole online database , it will return 0 hit.
  2. I just googled the gene name, it directed me to another database other than Genebank, where I can download the corresponding sequence, and with which I can use blastn to the whole online database it will return good results.
  3. When I search any one of these gene names in Entrez, I can get results of many different species with the same gene name.

Questions:

a. I wonder if the gene names are not recognized in NCBI database (because of 1. and 2.)? but it seems it can be recognized (because of 3.) ?

b. due to the thousands of gene names, I am not expecting to google each gene to get sequence or search in Entrez and manually identify the mouse species among many species, right?
Then how should I map these gene names to Accession Numbers, and then be able to use (for example) getgenbank(AccessionNumber) in Matlab for batch retrieving the sequences, and then use which to run a local pairwise blastn ? (Is my plan of this workflow ok? or there should be better way for my purpose of comparing two genes with their names in hand?)

I am really confused now as a rookie in bioinoformatics, Please help! Thank you very much!!

blast gene sequence • 2.1k views
ADD COMMENT
1
Entering edit mode

Uniprot maintains a list of synonyms for gene names. Gene names can be converted to uniprot ids and then these ids could be used to get other related ids using ID mapping tool http://www.uniprot.org/uploadlists/ . Alternatively ensemble biomart could be used.

ADD REPLY
0
Entering edit mode

Your Q missing key detail. What are the parameters/options that you use for NCBI blast (and also give the URL)

ADD REPLY
2
Entering edit mode
6.8 years ago

I would use the EnsEMBL API for this. You can query by all kind of synonyms and get the sequences all in one short script. I wouldn't recommend using Matlab for this, it's not very commonly used for bioinformatics so you'll quickly hit its limits and would have to learn something else.

ADD COMMENT
1
Entering edit mode
6.8 years ago
James Ashmore ★ 3.4k

This answer requires you to have R and the relevant Bioconductor packages installed:

# Load relevant Bioconductor packages
library("AnnotationDbi")
library("BSgenome.Mmusculus.UCSC.mm10")
library("org.Mm.eg.db")
library("TxDb.Mmusculus.UCSC.mm10.knownGene")

# Read the gene names from a file called geneNames.txt
geneNames <- read.table("geneNames.txt", header = FALSE)[, 1]

# Get the coordinates of all mouse genes
geneRanges <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)

# Annotate the coordinates with the gene symbol
geneRanges$gene_symbol <- mapIds(org.Mm.eg.db, keys = geneRanges$gene_id, column = "SYMBOL", keytype = "ENTREZID", multiVals = "first")

# Select only those genes which appear in your geneNames.txt file
selectGenes <- toupper(geneRanges$gene_symbol) %in% geneNames
geneRanges <- geneRanges[selectGenes]

# Get the sequence for each of your selected genes
geneSequences <- getSeq(BSgenome.Mmusculus.UCSC.mm10, geneRanges)
names(geneSequences) <- geneRanges$gene_symbol

# Write the sequences to disk
writeXStringSet(geneSequences, "geneSequences.fasta")
ADD COMMENT
0
Entering edit mode
6.8 years ago
mbk0asis ▴ 680

You can use one of files in NCBI gene database ftp site to convert IDs.

ADD COMMENT

Login before adding your answer.

Traffic: 1431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6