Dear All,
I am trying to get the positions of the SNPs present in the 5' and 3' UTRs of the BRCA1 mRNA.
For the same, I have got rsIDs from mysql UCSC command for BRCA1 like this:
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select chrom,chromStart,chromEnd,name,func from snp150 where chrom="chr17" and chromStart>26935980 and chromEnd <= 81742541 and (find_in_set("untranslated-5",func)!=0 or find_in_set("untranslated-3",func)!=0) '
When I ran the command, it gave me all the rsids like this: (total 144107 rsids)
chrom chromStart chromEnd name func
chr17 26935989 26935990 rs1002304941 intron,untranslated-3
chr17 26936027 26936028 rs1027461134 intron,untranslated-3
chr17 26936037 26936038 rs143374718 intron,untranslated-3
chr17 26936044 26936048 rs952717363 intron,untranslated-3
chr17 26936049 26936050 rs372539326 intron,untranslated-3
chr17 26936050 26936051 rs749382342 intron,untranslated-3
chr17 26936050 26936064 rs1004133390 intron,untranslated-3
chr17 26936095 26936097 rs965637526 intron,untranslated-3
chr17 26936096 26936097 rs550417993 intron,untranslated-3
chr17 26936100 26936101 rs147329625 intron,untranslated-3
chr17 26936106 26936107 rs186832245 intron,untranslated-3
chr17 26936118 26936119 rs977388897 intron,untranslated-3
chr17 26936132 26936133 rs543307824 intron,untranslated-3
Can someone please help me understand these queries below:
a) Among these rsids, for the 3' and 5' UTRs, if am right, I should extract only the "ncRNA,untranslated-3" & "ncRNA,untranslated-5"? Because there are many other entries such as, "intron,near-gene-5,untranslated-5", "untranslated-3"..etc
I have found using XSLT, 3' and 5' UTRs; for refseq geneid "NM_007299.3" of BRCA1 (shown below), and this refseq geneid has rsids, one such rsid is "rs863224421". (rsID information for this particular refseq geneID is obtained from xml output file of BRCA1)
>NM_007299.3|-195|5' UTR
cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
tggaacagaaagaa
>NM_007299.3|2294-|3' UTR
ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc
b) When I searched for "rs863224421" in the list of "144107" rsids obtained from the mysql ucsc command, I do not find it in the list? If am right, this rsid "rs863224421" should be present in the list? Please let me know if am missing some information.
c) From the XSLT result of BRCA1 refseq geneids like "NM_007299.3" shown above, could you please tell me, how to find what rsID belong to the 5' and 3' UTR of "NM_007299.3", and at what position? For instance, information like this:
refseq-gene-id mutated-allele position
NM_007299.3|-195| 5' UTR (let's say) c to t let's say (200)
Please let me know if something is not clear.
Thanks much! :-) DK
Thanks much Fin! :-) I would really appreciate if you please tell me the part -
how to extract the 5' and 3' UTRs using this?
And,
In this link: (below, there are two 5' UTR entries, how to know which one to take? ) http://rest.ensembl.org/lookup/id/ENST00000357654?content-type=application/json;expand=1;utr=1 [{"source":"ensembl_havana","object_type":"five_prime_UTR","species":"homo_sapiens","assembly_name":"GRCh38","Parent":"ENST00000357654","end":43125370,"seq_region_name":"17","db_type":"core","strand":-1,"id":"ENST00000357654","start":43125271},{"source":"ensembl_havana","object_type":"five_prime_UTR","species":"homo_sapiens","assembly_name":"GRCh38","Parent":"ENST00000357654","end":43124115,"seq_region_name":"17","db_type":"core","strand":-1,"id":"ENST00000357654","start":43124097},
Thanks much!
Hello,
what you get back from the REST-API is in JSON format. Most commonly used programming languages can parse this format very easyly. Which one dou you want to use?
Oh, sorry. I wasn't aware that the UTR can of course span multiple non coding exons. So there might be multiple regions. You have than have to use all of them.
fin swimmer
Thanks Fin! I would like to use Bash/shell to extract the regions. Could you please tell me how this complete procedure can be automated using bash?
Thanks, DK
Hello again,
I don't know how to do this in bash. There might be a way, but I think it's much more complicated. Here I can provide a quick-and-dirty solution for python. Just change the transcript to your needs:
Thanks Fin! :-) The program gives the rsIDs output of the UTRs. I was asking about getting the 5' and 3' UTR sequence between these regions:
like this below: (Could you please share how this could be done?)
And in the script you shared, the output is:
please let me know if it is possible to add the "reference/wild" and the "mutant" allele for every rsID like this:
Thanks much!