how to get promoter regions from a genome
2
0
Entering edit mode
9.2 years ago
Affan ▴ 300

I have a PWM of which I'd like to asses accuracy. The problem with scanning the ENTIRE genome (other than the fact its a overnight code run) is that it yields a lot of false positives compare to true positives.

Going through some posts on biostars, it may make more sense to only scan the promoter regions with my PWM. Is there an easy way to extract/get promoter regions for hg18 and mm9?

promoter pwm • 3.9k views
ADD COMMENT
2
0
Entering edit mode

Would I have to deal with gene information? Obviously the transcription factor my PWM is based on is a for muscle genes. Would I have to only look for promoters around these genes?

ADD REPLY
0
Entering edit mode

I am working with bacteria and there were several paper that reported the presence of binding sites inside ORFs. So on the one hand it would make sense to scan the whole genome. On the other hand, as you said, this will create a lot of false hits. A first and relatively conservative approach I would say that is scanning only the upstream regions. This is commonly done, and I would say that you will find for sure the more conserved boxes.

ADD REPLY
0
Entering edit mode

So I've built my PWM based on TF data. I already have the start and end coordinates for the TF binding site. Would it be sufficient to just grab a 1000bp neighbourhood around this binding site? Is 1000bp sufficient enough to encapsulate the entire promoter region for this particular binding site?

ADD REPLY
0
Entering edit mode

For prokaryotes 300 to 500 bp are commonly used.

ADD REPLY
0
Entering edit mode

I'm working with mm9, and so would a 1000bp (or 2000bp) neighbourhood be sufficient to calculate the accuracy?

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

I don't really know the genes. I just know the coordinates of where the transcription factor bind. Is it sufficient for me to take a 2000bp neighbourhood around this coordinate. I would imagine that encapsulates the binding site.

ADD REPLY
0
Entering edit mode
In that case you cannot define the range for binding sites (which could have more variation). But you can start with ± 1kb and extend the search upto ± 5kb.
ADD REPLY
0
Entering edit mode

Okay, thanks. I am not sure if taking a range will help me though. Whether its a 1kb or 5kb neighbourhood, there only exists one true site in there given by my coordinate. However, the larger my neighbourhood, the higher number of false positives I get. I think I'll take a 2kb neighbourhood as promoter regions are upto 1000bp anyway.

ADD REPLY
0
Entering edit mode

Great finally came to an conclusion. I still do not understand what your actual question is about!! If you are looking for exact transcription factor binding site the length is usually between 5 to 31 nucleotides. The question you started was how to extract promoter regions, which already includes TFB sites. Anyway you found a solution. Best of luck.

ADD REPLY

Login before adding your answer.

Traffic: 2878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6