Biostar Beta. Not for public use.
Refseq Version in hg19 Human Reference Genome
0
Entering edit mode
2.2 years ago

Hi.

I'm trying to retrieve transcript sequenceses from transcripts IDs. I need the same transcript version from the one used in the experiment that i'm considering, in which version is missing.

I know from the experiment description that raw reads were mapped to hg19 transcriptome, which was aggregated from UCSC RefSeq and Genecode v12 databases.

I've tried to use GTF file from hg19, but versions don't match. I'm working on a large dataset, so i'll need a easy and direct way to determinate the right versions.

Thank you!

0
Entering edit mode

Why do you not want to use the latest version of the RefSeq transcripts? There may be very minor changes but in any case the latest version reflects the best information available at the present.

0
Entering edit mode

Because for each transcript i have a string of scores, each score referred to a single nucleotide, than it's important for me to know the correct sequence. Morover different versions change in lenght, the positional score information would be unuseful.

1
Entering edit mode
4 weeks ago
genomax 68k
United States

You can use NCBI's E-utilities (not unix command line utils) to retrieve this information.e.g. You can retrieve previous version, for example NM_011721.3 (current version is NM_011721.4) by using a URL in this format: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3

If you need just sequence then you could do: https://www.ncbi.nlm.nih.gov/nuccore/NM_011721.3?report=fasta

0
Entering edit mode

Thank you, I did what you suggested but another problem came up. The lenght of some transcripts differ from the ones of my experimental data. I checked different versions but none of them seems to be right. I noticed that some NCBI transcripts have poly-A tails. Is there a way to clean the sequences? I already tried with strip(A) function, but it seems that some times my transcripts cointain some As at the ending, and removing also these the lenght doesn't match. I wrote a script that strips all As ending and than add the number of As for the right match with my transcript(max 10), if this doesn't match it changes version. in this way it seems to work but i can't be sure that i'm picking the right version and that the sequences is not altered. For me it's important that the sequence in all its nucleotides is the same of the one of the experiment, to avoid scores shifts that will invalidate all the dataset.

0
Entering edit mode

The lenght of some transcripts differ from the ones of my experimental data.

Since the information at NCBI is archival (not changed, that is why the version numbers) all bets are probably off. Especially, if things don't match and you need to start modifying NCBI data so they do. It sounds like you did not do the analysis for the results you are working with. Do you have a way to know how the old results were analyzed? Are you able to repeat the analysis (probably very painful) with current/traceable information? You seem to have a special use case anyway.