Question

Mapping refseq transcripts to encoded proteins (NM_ to NP_)

0

Entering edit mode

9.7 years ago

Endre Bakken Stovner ▴ 960

I'd like to find the protein encoded by my refgene transcripts.

I have the refGene.txt file where each line looks like the following:

138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    68346571,68378643,68434283,68434627,68438153,68438903,68445927,68457068,68466069,68467974,68468811,68473549,68475967,68479879,    68346688,68379088,68434368,68434675,68438244,68439038,68446033,68457142,68466230,68468105,68468992,68473692,68476005,68480404,    0    PIAS1    cmpl    cmpl    0,0,1,2,2,0,0,1,0,2,1,2,1,0,

I'd like to know how I can get the NP_XXXX name for the transcript NM_016166.

Preferably, I'd like to just get a text file mapping the two in some way, but I'll accept any answer that does this automatically, e.g. with BioPython (having to look it up by hand in a browser or some such doesn't cut it - I need to do this for 50K transcripts).

I'm working with hg38, but I'm guessing the procedure is the same for all major genome versions, so I did not specify to make the Q as general as possible.

refseq • 5.0k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 9.7 years ago by Endre Bakken Stovner ▴ 960

Ram · Accepted Answer · 2014-08-22

2

Entering edit mode

9.7 years ago

5heikki 11k

Here's one way with Entrez Direct:

IFS=$'\n'; for next in $(cut -f2 refGene.txt); do NP=$(esearch -db nuccore -query $next | elink -target protein | efetch -format docsum | xtract -element Caption); echo "$next $NP"; done

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.7 years ago by 5heikki 11k

0

Entering edit mode

Thanks, I need to install some tools so will get back to you. Upvote.

Ps. other answers still welcome. Is there a file that contains these mappings somewhere?

ADD REPLY • link 9.7 years ago by Endre Bakken Stovner ▴ 960

0

Entering edit mode

Your previous, non-while version was better. This one includes an error (and only reads the first line anyways, see this post on StackOverflow.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.7 years ago by Endre Bakken Stovner ▴ 960

0

Entering edit mode

My, bad. I edited the answer again.

ADD REPLY • link 9.7 years ago by 5heikki 11k

Ram · Accepted Answer · 2015-08-05

2

Entering edit mode

8.7 years ago

Endre Bakken Stovner ▴ 960

Hello myself from the past! After having done similar tasks umpteen gazillion times you realized you should make them easier for yourself. Therefore you wrote biomartian (pip install biomartian:

echo "138 NM_016166 chr15 + 68346571 68480404 68346664 68480173 14" | sed 's/ /\t/g' | biomartian --noheader -c 1 -i refseq_mrna -o refseq_peptide -
0    1    2    3    4    5    6    7    8    refseq_peptide
138    NM_016166    chr15    +    68346571    68480404    68346664    68480173    14    NP_057250

(The code before the biomartian call is just used to replace spaces with tabs).

See https://github.com/endrebak/biomartian for more

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Endre Bakken Stovner ▴ 960

0

Entering edit mode

this example does not work in my hands. The refseq_peptide returns blank.

ADD REPLY • link 7.8 years ago by Malcolm.Cook ★ 1.5k

Ram · Accepted Answer · 2014-08-22

1

Entering edit mode

9.7 years ago

Endre Bakken Stovner ▴ 960

The UCSC table to map transcripts to proteins is called refLink.

You can get it as a plain-text file by going to the UCSC table browser, selecting your genome of interest and then setting the group to "all tables". Next choose the table refLink, enter a filename and then press the "get output"-button.

Example output:

#name product mrnaAcc protAcc geneName prodName locusLinkId omimId
PAX2 paired box 2 NM_001282819 NP_001269748 133420 110035 102094402 0
MLPH melanophilin NM_001282836 NP_001269765 65090 92499 102093006 0

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Endre Bakken Stovner ▴ 960

0

Entering edit mode

Great! Your answer help me a lot!

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Li Hsing ▴ 10