How to select only protein coding mRNAs from a really long list of ENSG IDS
2
0
Entering edit mode
4.9 years ago
curious ▴ 750

I have an R dataframe with a column of ENSG IDs.

I believe it contains non-protein coding IDs that I do not want

I only want to keep the rows that correspond to protein coding mRNAs

I am looking for a source of ENSG IDs (list or similar) that only contains IDs corresponding to protein coding mRNA.

I don't really need help with the coding, I just am looking for the data source.

The best thing I can think to do is scrape gencode's "Protein-coding transcript sequences" fasta, but there is hopefully a better way.

Thank you.

ensg ensembl ids rna • 2.5k views
ADD COMMENT
2
Entering edit mode
4.9 years ago
curious ▴ 750

this is what I ended up with if anyone is curious:

library(biomaRt)

ensembl = useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")

example_ids = c("ENSG00000172927", "ENSG00000224713", "ENSG00000135269", "ENSG00000272555", "ENSG00000013588")


res <- getBM(attributes=c("ensembl_gene_id","gene_biotype"),filters = c("ensembl_gene_id","biotype"), values=list(example_ids,"protein_coding"), mart=ensembl)
res
ADD COMMENT
1
Entering edit mode
4.9 years ago
h.mon 35k

Use the biomaRt BioConductor package to query Ensembl directly. See Ensembl: Protein coding transcript ids for pointers.

ADD COMMENT

Login before adding your answer.

Traffic: 2964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6