how to identify TATA box in a gene list
1
0
Entering edit mode
7.1 years ago
Lila M ★ 1.2k

Hi everybody,

Is the first time that I've tried to look for TATA box in a gene list. I used homer (findMotif.pl) to do that, but I don't know how to read the results. Is there any way in which I can identify only the TATA boxes for a set of gene list? Ideally, I would like to know the frequency of the genes in my gene list with TATA box.

Thank you very much in advance!

TATA homer gene • 4.2k views
ADD COMMENT
0
Entering edit mode

Please, can you be more clear on what you mean with "gene list"? If it is a FASTA file with the sequences of the genes of your organism, like a transcriptome, you won't find TATA boxes (or maybe you will find only false positives) because it usually resides before the startpoint of transcription.

ADD REPLY
0
Entering edit mode

as HMMER accepts Ensembl Gene IDs list, this is what I am using at that moment

ADD REPLY
0
Entering edit mode

Did you use the -find <motif file> option?

ADD REPLY
0
Entering edit mode

No, How can create this "motif file"?

ADD REPLY
0
Entering edit mode

Read carefully the documentation in the link you sent me. http://homer.ucsd.edu/homer/microarray/index.html

At the section "Finding Instances of Specific Motifs" they explain what you need.

Also: http://homer.ucsd.edu/homer/motif/creatingCustomMotifs.html

ADD REPLY
0
Entering edit mode

Yes, I did, but for me is not very intuitive, what I am trying is download the TATA motif and use it at <motif file=""> , could that work?

findMotifs.pl gene human  -find motif_TATA.motif > TATA_results

Thank you!

ADD REPLY
0
Entering edit mode

I can't tell unless you paste what is inside the motif_TATA.motif...

I assume you have to specify a motif name one per line, didn't the second link help on that?

ADD REPLY
0
Entering edit mode

Yes, and I assume that is the same, right?

>CCTTTTATAGNC   TATA-Box(TBP)/Promoter/Homer,BestGuess:TATA-Box(TBP)/Promoter/Homer(1.000)  5.682591    -3.995489e+02   0   124715.9,9565.0,11407.0,1769.0,0.00e+00
0.025   0.606   0.283   0.086
0.006   0.85    0.004   0.14
0.091   0.125   0.001   0.783
0.001   0.048   0.001   0.95
0.297   0.001   0.001   0.701
0.234   0.001   0.001   0.764
0.964   0.034   0.001   0.001
0.351   0.001   0.172   0.476
0.704704704704705   0.00900900900900901 0.285285285285285   0.001001001001001
0.042   0.2 0.716   0.042
0.178178178178178   0.355355355355355   0.328328328328328   0.138138138138138
0.117   0.51    0.304   0.069
ADD REPLY
0
Entering edit mode

Was this created with seq2profile.pl ?

seq2profile.pl <consensus> [# mismatches] [name] > output.motif

i.e. seq2profile.pl TATA 0 ets > output.motif

ADD REPLY
0
Entering edit mode

I think that is not necessary, If I downloaded the matrix from HOMER (paste above), is can be recognized.

ADD REPLY
0
Entering edit mode

So please, provide us the output that you're not able to understand and we can see if someone of us does! ;)

ADD REPLY
0
Entering edit mode

The output of the result! As in it I can see the sequence, is not exactly the TATA sequence (I don't know if HOMER only report the most similar one) so How can be sure that the genes that the output report has TATA boxes?

GeneID  PromoterID  Offset  Sequence    Motif Name  Strand  MotifScore  Unigene Refseq  Ensembl Name    Alias   Orf Chr Description Type
644353  NM_001143978    -276    CGGTCTAAAAGC    TATA-Box(TBP)/Promoter/Homer,BestGuess:TATA-Box(TBP)/Promoter/Homer(1.000)  -   8.308651    Hs.648338   NM_001143978    ENSG00000166707 ZCCHC18 PNMA7B|SIZN2    -   Xq22.2  zinc finger CCHC-type containing 18 protein-coding
25771   NM_014346   -66 CCTTTAATAACG    TATA-Box(TBP)/Promoter/Homer,BestGuess:TATA-Box(TBP)/Promoter/Homer(1.000)  +   7.344113    Hs.435044   NM_014346   ENSG00000054611 TBC1D22A    C22orf4|HSC79E021   -   22q13.31    TBC1 domain family member 22A   protein-coding
23464   NM_014291   -63 CTTTTTAAGCGA    TATA-Box(TBP)/Promoter/Homer,BestGuess:TATA-Box(TBP)/Promoter/Homer(1.000)  +   6.041534    Hs.54609    NM_014291   ENSG00000100116 GCAT    KBL -   22q13.1 glycine C-acetyltransferase protein-coding
ADD REPLY
1
Entering edit mode

These kind of predictions usually come with an e-Value or a probability score. In this case, you have a lod-score (logarithm of the odds) that is associated to every line, present in field number 6 as "MotifScore". The documentation you provided says, at some point:

"Motif Score (log odds score of the motif matrix, higher scores are better matches)"

A good approach would be to plot all the lod-scores and see the distribution to infer which ones are the good ones are which ones are not.

ADD REPLY
0
Entering edit mode

You are right, but for me "higher scores are better matches" is not much informative (what is considered higher and lower?, how can I set a proper cut-off? in my opinion there is not much information and is a bit complicate be sure about the result...

ADD REPLY
1
Entering edit mode

Plot them > see the distribution > see what others do > decide your thresholds.

There is no good threshold, said once a lab guru. :)

What others do:

What Is The Lod Score Replication Threshold For Linkage Analysis?

http://www.bio.brandeis.edu/InterpGenes/Project/align16.htm

https://www.mun.ca/biology/scarr/LOD_analysis.html

ADD REPLY
0
Entering edit mode

Thank you very much for the information! :)

ADD REPLY
0
Entering edit mode
7.1 years ago
theobroma22 ★ 1.2k

https://www.dnalc.org/resources/geneboy.html

Put your sequence in geneboy to find the tatabox.

ADD COMMENT
1
Entering edit mode

As I said previously, I don't have any sequence, I only have gene IDs

ADD REPLY

Login before adding your answer.

Traffic: 3221 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6