Question

SNP position in UTRs

0

Entering edit mode

6.6 years ago

DanielC ▴ 170

Dear All,

I am trying to get the positions of the SNPs present in the 5' and 3' UTRs of the BRCA1 mRNA.

For the same, I have got rsIDs from mysql UCSC command for BRCA1 like this:

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19  -e 'select chrom,chromStart,chromEnd,name,func from snp150 where chrom="chr17" and chromStart>26935980 and chromEnd <= 81742541 and (find_in_set("untranslated-5",func)!=0 or find_in_set("untranslated-3",func)!=0) '

When I ran the command, it gave me all the rsids like this: (total 144107 rsids)

chrom   chromStart      chromEnd        name    func
chr17   26935989        26935990        rs1002304941    intron,untranslated-3
chr17   26936027        26936028        rs1027461134    intron,untranslated-3
chr17   26936037        26936038        rs143374718     intron,untranslated-3
chr17   26936044        26936048        rs952717363     intron,untranslated-3
chr17   26936049        26936050        rs372539326     intron,untranslated-3
chr17   26936050        26936051        rs749382342     intron,untranslated-3
chr17   26936050        26936064        rs1004133390    intron,untranslated-3
chr17   26936095        26936097        rs965637526     intron,untranslated-3
chr17   26936096        26936097        rs550417993     intron,untranslated-3
chr17   26936100        26936101        rs147329625     intron,untranslated-3
chr17   26936106        26936107        rs186832245     intron,untranslated-3
chr17   26936118        26936119        rs977388897     intron,untranslated-3
chr17   26936132        26936133        rs543307824     intron,untranslated-3

Can someone please help me understand these queries below:

a) Among these rsids, for the 3' and 5' UTRs, if am right, I should extract only the "ncRNA,untranslated-3" & "ncRNA,untranslated-5"? Because there are many other entries such as, "intron,near-gene-5,untranslated-5", "untranslated-3"..etc

I have found using XSLT, 3' and 5' UTRs; for refseq geneid "NM_007299.3" of BRCA1 (shown below), and this refseq geneid has rsids, one such rsid is "rs863224421". (rsID information for this particular refseq geneID is obtained from xml output file of BRCA1)

    >NM_007299.3|-195|5' UTR
    cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
    aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
    ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
    tggaacagaaagaa
    >NM_007299.3|2294-|3' UTR
    ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
    gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
    aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
    agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
    gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc

b) When I searched for "rs863224421" in the list of "144107" rsids obtained from the mysql ucsc command, I do not find it in the list? If am right, this rsid "rs863224421" should be present in the list? Please let me know if am missing some information.

c) From the XSLT result of BRCA1 refseq geneids like "NM_007299.3" shown above, could you please tell me, how to find what rsID belong to the 5' and 3' UTR of "NM_007299.3", and at what position? For instance, information like this:

refseq-gene-id                                         mutated-allele                                position
NM_007299.3|-195| 5' UTR                    (let's say) c to t                              let's say (200)

Please let me know if something is not clear.

Thanks much! :-) DK

SNP • 1.9k views

ADD COMMENT • link updated 3.6 years ago by Biostar 20 • written 6.6 years ago by DanielC ▴ 170

score 1 · Answer 1 · 2017-09-23

1

Entering edit mode

6.6 years ago

finswimmer 16k

Hello,

I'm a bit confused about the things you have tried to solve your task. This is what I would do:

First you need tha genomic region for your 5'- and 3'-UTR. This depends on your transcript. One could use ensembl's REST-API.

E.g.: http://rest.ensembl.org/lookup/id/ENST00000357654?content-type=application/json;expand=1;utr=1

You can than extract the genomic region from that result:

5'-UTR: 17:43124097-43124115

3'-UTR: 17:43044295-43045677

With this you can ask the API which variants overlap this region:

http://rest.ensembl.org/overlap/region/human/17:43124097-43124115?feature=variation;content-type=application/json

http://rest.ensembl.org/overlap/region/human/17:43044295-43045677?feature=variation;content-type=application/json

fin swimmer

ADD COMMENT • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Thanks much Fin! :-) I would really appreciate if you please tell me the part -

how to extract the 5' and 3' UTRs using this?

5'-UTR: 17:43124097-43124115

3'-UTR: 17:43044295-43045677

And,

In this link: (below, there are two 5' UTR entries, how to know which one to take? ) http://rest.ensembl.org/lookup/id/ENST00000357654?content-type=application/json;expand=1;utr=1 [{"source":"ensembl_havana","object_type":"five_prime_UTR","species":"homo_sapiens","assembly_name":"GRCh38","Parent":"ENST00000357654","end":43125370,"seq_region_name":"17","db_type":"core","strand":-1,"id":"ENST00000357654","start":43125271},{"source":"ensembl_havana","object_type":"five_prime_UTR","species":"homo_sapiens","assembly_name":"GRCh38","Parent":"ENST00000357654","end":43124115,"seq_region_name":"17","db_type":"core","strand":-1,"id":"ENST00000357654","start":43124097},

Thanks much!

ADD REPLY • link 6.6 years ago by DanielC ▴ 170

1

Entering edit mode

Hello,

what you get back from the REST-API is in JSON format. Most commonly used programming languages can parse this format very easyly. Which one dou you want to use?

In this link: (below, there are two 5' UTR entries, how to know which one to take? )

Oh, sorry. I wasn't aware that the UTR can of course span multiple non coding exons. So there might be multiple regions. You have than have to use all of them.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

what you get back from the REST-API is in JSON format. Most commonly used programming languages can parse this format very easyly. Which one dou you want to use?

Thanks Fin! I would like to use Bash/shell to extract the regions. Could you please tell me how this complete procedure can be automated using bash?

Thanks, DK

ADD REPLY • link 6.6 years ago by DanielC ▴ 170

1

Entering edit mode

Hello again,

I don't know how to do this in bash. There might be a way, but I think it's much more complicated. Here I can provide a quick-and-dirty solution for python. Just change the transcript to your needs:

import requests
import sys

transcript = "ENST00000357654"

server = "http://rest.ensembl.org"
ext = "/lookup/id/"+transcript+"?content-type=application/json;expand=1;utr=1"

r = requests.get(server+ext, headers={"Content-Type": "application/json"})

if not r.ok:
    r.raise_for_status()
    sys.exit()

lookup = r.json()

for data in lookup['UTR']:
    region = str(data['seq_region_name'])+":"+str(data['start'])+"-"+str(data['end'])
    ext = "/overlap/region/human/"+region+"?feature=variation;content-type=application/json"

    r = requests.get(server+ext, headers={"Content-Type": "application/json"})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    variations = r.json()

    for var in variations:
        print(var['seq_region_name'], var['start'], var['end'], var['id'], data['object_type'], sep="\t")

ADD REPLY • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Thanks Fin! :-) The program gives the rsIDs output of the UTRs. I was asking about getting the 5' and 3' UTR sequence between these regions:

5'-UTR: 17:43124097-43124115

3'-UTR: 17:43044295-43045677

like this below: (Could you please share how this could be done?)

 >NM_007299.3 | 43124097-43124115 | 5' UTR
cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
tggaacagaaagaa

>NM_007299.3 | 43044295-43045677 | 3' UTR
    ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
    gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
    aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
    agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
    gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc

And in the script you shared, the output is:

17  43125277    43125277    rs1057522752    five_prime_UTR
17  43125280    43125280    rs774839252 five_prime_UTR
17  43125288    43125288    rs1057524754    five_prime_UTR
17  43125293    43125293    rs544342552 five_prime_UTR
17  43125296    43125296    rs1057521869    five_prime_UTR
17  43125300    43125300    rs1057523081    five_prime_UTR
17  43125309    43125309    rs1031507808    five_prime_UTR
17  43125310    43125310    rs900972535 five_prime_UTR

please let me know if it is possible to add the "reference/wild" and the "mutant" allele for every rsID like this:

17  43125277    43125277    rs1057522752    five_prime_UTR         C           T
17  43125280    43125280    rs774839252 five_prime_UTR          T            G
17  43125288    43125288    rs1057524754    five_prime_UTR          A          T
17  43125293    43125293    rs544342552 five_prime_UTR           C         G
17  43125296    43125296    rs1057521869    five_prime_UTR          A            C
17  43125300    43125300    rs1057523081    five_prime_UTR          T           A
17  43125309    43125309    rs1031507808    five_prime_UTR           G         C
17  43125310    43125310    rs900972535 five_prime_UTR          C           T

Thanks much!

ADD REPLY • link 6.6 years ago by DanielC ▴ 170

score 1 · Answer 2 · 2017-09-26

1

Entering edit mode

6.6 years ago

finswimmer 16k

Depending on what you want to do you can also just go to the Variation Table of the transcript, use the filter for just showing UTR variants and export it as a csv file.

Be aware that the reference genome used by ensembl is hg38. If you need hg go here.

ADD COMMENT • link 6.6 years ago by finswimmer 16k