Biostar Beta. Not for public use.
Question: How To Get Promoter Sequences For Human Genes?
15
Entering edit mode

Would you advice how to get promoter sequences for all or many human genes - in flat file(s) or by SQL query? I understand there can be multiple definitions for promoter region, but just anything universal would work.

ADD COMMENTlink 9.9 years ago Yuri ♦ 1.5k • updated 23 months ago jcb2g • 0
Entering edit mode
0

I asked a _similar_ question here: http://biostar.stackexchange.com/questions/544/suggestions-developing-a-pipe-line-for-scanning-genomic-regions-to-identify-kno
My question was generic in nature and there was not much response. Looking forward for others comments on this specific question.

ADD REPLYlink 9.9 years ago
Khader Shameer
18k
Entering edit mode
0

The problem is that there is not a standard definition of promoters: for some it means a number of bases upstream the ATG, for others the TATA box, etc..

ADD REPLYlink 9.9 years ago
Giovanni M Dall'Olio
26k
Entering edit mode
0

@giovanni: I completely agree.

ADD REPLYlink 9.9 years ago
Yuri
♦ 1.5k
Entering edit mode
0

Look at the web site below for around 13,000 human promoters.

http://www.people.virginia.edu/~akr4xc/Human%20Gene%20Promoters%20-%20Home.html

ADD REPLYlink 2.9 years ago
sallie7733brown
• 0
14
Entering edit mode

My best bet would be to use BioMart's Martview, select a database, filter by the gene IDs you have (there are other ID options there too), and the use the sequence option in the attributes to determine which parts of the gene you want, be exon, intron, promoter, upstream, downstream, etc.

I used this tool to get many upstream regions for mouse genes using just NCBI's gene IDs as input.

ADD COMMENTlink 9.9 years ago Paulo Nuin ♦ 3.7k
Entering edit mode
0

What database you'd recommend? I have HUGO gene symbols.

ADD REPLYlink 9.9 years ago
Yuri
♦ 1.5k
Entering edit mode
0

This url will give you an idea about what you need to do.

Just click on Filters, enter the HGNC ids in the box and run the search.

ADD REPLYlink 9.9 years ago
Paulo Nuin
♦ 3.7k
• updated 18 months ago
RamRS
21k
Entering edit mode
0

I like both solution, but this one is more strait forward and allow downloading large amount of sequences at once. Pierre's solution would be also very useful in other cases. Thank you, guys.

ADD REPLYlink 9.9 years ago
Yuri
♦ 1.5k
10
Entering edit mode

Here I query the UCSC mysql anonymous server for the coordinate of the region between the CDS and the transcription sites (5' UTR, but you can extend this position to get a longer 'promoter' ) ( only for strand "+", for the reverse strand use cdsEnd and txEnd...). It builds an cURL query for the USC DAS server. This url is then piped into sh to get the genomic sequences.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N \
 -e 'select concat("curl http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",chrom,":",txStart+1,",",cdsStart+1) from knownGene where strand="+" and  txStart!= cdsStart limit 10'  |\
 sh > result.concatenated.xml

result:


http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr1" start="11874" stop="12190" version="1.00">
<DNA length="317">
cttgccgtcagccttttctttgacctcttctttctgttcatgtgtatttg
ctgtctcttagcccagacttcccgtgtcctttccaccgggcctttgagag
gtcacagggtcttgatgctgtggtcttcatctgcaggtgtctgacttcca
gcaactgctggcctgtgccagggtgcaagctgagcactggagtggagttt
tcctgtggagaggagccatgcctagagtgggatgggccattgttcatctt
ctggcccctgttgtctgcatgtaacttaataccacaaccaggcatagggg
aaagattggaggaaaga
</DNA>
</SEQUENCE>
</DASDNA>

http://www.biodas.org/dtd/dasdna.dtd">
<DASDNA>
<SEQUENCE id="chr1" start="322037" stop="324343" version="1.00">
<DNA length="2307">
gggtctccctctgttgtccaaggctggagtgtagtagtgctatcgcagct
gactgcagcctcaaccttccaggctgaagcgatcctcccacctcaacctc
ccacgtggctgagactacaggtgcttgccactatgcccaactaacatttg
gaattttcgtatacgtggattccagaggggtgacagcgaaacgtgagtaa
(...)
ADD COMMENTlink 9.9 years ago Pierre Lindenbaum 120k • updated 18 months ago RamRS 21k
Entering edit mode
0

Thanks, Pierre, I'll try it. Do you know what is the highest LIMIT allowed?

ADD REPLYlink 9.9 years ago
Yuri
♦ 1.5k
Entering edit mode
0

@yuri don't be evil with UCSC :-) http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29

ADD REPLYlink 9.9 years ago
Pierre Lindenbaum
120k
4
Entering edit mode

This is trivial to do with the UCSC table browser.

http://genome.ucsc.edu/cgi-bin/hgTables?command=start

Select the gene track of interest. Then, select "sequence" for output option. Click "get output". On the next page, select "genomic" and click submit. On the next page, click the appropriate boxes, one of which is upstream by N bases. Your output will be the actual sequence. Alternatively, you can get just the coordinates by changing the parameters on the first table browser page.

Sean

ADD COMMENTlink 9.9 years ago Sean Davis 25k
Entering edit mode
0

but you cannot do that for thousand genes...

ADD REPLYlink 9.9 years ago
Pierre Lindenbaum
120k
Entering edit mode
0

In fact you can do for thousands of genes, but it will be slow as molasses, just click on upload identifiers and you have a input box. One problem with the UCSC approach is that sometimes it doesn't find the ids you are looking for and it does not output them. BioMart outputs everything even the "empty" ones.

ADD REPLYlink 9.9 years ago
Paulo Nuin
♦ 3.7k
Entering edit mode
0

Just to be clear, what I described in my answer is for ALL transcripts in a track of interest and it returns instantaneously with network bandwidth constraints, of course. nuin points out that it is straightforward to give a list of IDs, but the point about the "empty" ones is valid.

ADD REPLYlink 9.9 years ago
Sean Davis
25k
2
Entering edit mode

The Regulatory Sequence Analysis Tools website is very handy when it comes to obtaining promoter sequences and allows nice customization for up and or downstream size with respect to certain landmarks. Even better, a bunch of species is supported.

http://rsat.ulb.ac.be/rsat/

ADD COMMENTlink 9.9 years ago Gurado • 280
0
Entering edit mode

Here's a command-line based method using UCSC and bedtools. It assumes you have a local copy of the genome, bedtools installed, and the promoter is some number that you choose relative to the transcription start site (TSS). At UCSC the left coordinate is always txStart, whereas the TSS is where transcription starts and can be the left or right coordinate depending on strand. Thus to get TSS I grab txStart for positive strand genes and txEnd for negative strand genes. I'm not sure if there's a way to combine this into a single SQL statement.

Step 1 - get TSS from UCSC using MySQL, use tail to remove header line:

# TSS for plus strand genes
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e \
'SELECT chrom,txStart,txStart,"TSS",".",strand FROM knownGene WHERE strand = "+";' \
| tail -n +2 > tss.bed

# concatenate TSS for negative strand genes
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e \
'SELECT chrom,txEnd,txEnd,"TSS",".",strand FROM knownGene WHERE strand = "-";' \
| tail -n +2 >> tss.bed

Step 2 - use bedtools flank or slop to adjust the coordinates. The -s flag will take strand into account for determining left or right. Flank takes the base next to your TSS, slop includes it. A nice advatange of bedtools for this is that you hand it a file of chromosome sizes so it doesn't create coordinates beyond chromosomal ends.

# create promoter coordinates, 1000 bases upstream of TSS for example
bedtools flank -i tss.bed -g hg19.chrom.sizes -s -l 1000 -r 0 > promoter_coords.bed

Step 3 - use bedtools to extract DNA sequences:

# extract DNA sequence from fasta file
bedtools getfasta -fi genomes/hg19/all_chr.fa -bed promoter_coords.bed -fo promoter_seq.fa

This should work for any model organism at UCSC, just select the right db and table names.

ADD COMMENTlink 5.7 years ago seidel 6.8k • updated 18 months ago RamRS 21k
0
Entering edit mode

Look at the web site below for about 13,000 human promoter sequences.

http://www.people.virginia.edu/~akr4xc/Human%20Gene%20Promoters%20-%20Home.html

ADD COMMENTlink 2.9 years ago sallie7733brown • 0
Entering edit mode
0

If you wish to promote your curated dataset then create a new post with "tools" tag instead of posting the same message in multiple older threads.

ADD REPLYlink 2.9 years ago
genomax
68k
0
Entering edit mode

At the University of Virginia we have collected more than 13,000 promoters of human genes. These are available online for download at the URL given below.

http://www.people.virginia.edu/~akr4xc/Human%20Gene%20Promoters%20-%20Home.html

ADD COMMENTlink 23 months ago jcb2g • 0

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0