TFBS enrichiment analysis
1
3
Entering edit mode
5.2 years ago
mannoulag1 ▴ 120

Hi, I have a list of Arabidopsis gene symbols, how to do the enrichment analysis to identify their Transcription factors? It is possible to do this by TFBStools package? thanks

tfbstools R arabidopsis jaspar gene • 3.1k views
ADD COMMENT
0
Entering edit mode

What do you mean their transcription factors? TFs that bind to their promoters? Regardless, HOMER or AME from the MEME suite are probably your best bets. From a quick glance, it doesn't seem like the TFBStools package does motif enrichment analyses.

ADD REPLY
0
Entering edit mode

thank you very much, but I am looking for a R package.

ADD REPLY
0
Entering edit mode

PWMenrich may work for you then, though I've never used it and can't vouch for its results/ease of use.

ADD REPLY
0
Entering edit mode

thank you, I will try it.

ADD REPLY
7
Entering edit mode
5.2 years ago

One possible approach is described below, which involves some manual work, but which facilitates a bit more control over inputs, outputs, and parameters:

  1. Get the annotations of your genes. The positions of these annotations should, ideally, match the assembly version you are using for FIMO calls, described below.

    These annotations should be formatted in a sorted BED6+ file, or converted to one via gtf2bed, gff2bed or other conversion tools that output sorted BED, with the ID in the fourth column and the strand information in the sixth column.

  2. Pad out -1000/+200 of the annotation TSSs, by strand, via, e.g. bedops:

    $ awk '($6 == "+"){ print $1, ($2-1), $2, $4 }' annotations.bed | bedops --range -1000:200 --everything - > tss.pad.for.bed
    $ awk '($6 == "-"){ print $1, $3, ($3+1), $4 }' annotations.bed | bedops --range -200:1000 --everything - > tss.pad.rev.bed
    $ bedops --everything tss.pad.for.bed tss.pad.rev.bed > promoters.bed
    

    You might change these bounds depending on what you define as a promoter, or other regulatory region where TFs would bind to and regulate gene activity.

  3. Do a FIMO scan at 1e-4 or other p-value threshold against your plant TF database(s) of choice (TRANSFAC, JASPAR, Athamap and CIS-BP are possibilities, for instance).

    I have an answer on the Bioinformatics SE site that explains how to do a FIMO scan for hg19 (human), against the non-redundant JASPAR vertebrate TF model database: https://bioinformatics.stackexchange.com/a/2491/776

    If you use FIMO, you would repeat this or something similar for your assembly of Arabidopsis and for published TF model databases for Arabidopsis.

    The output of FIMO will be a collection of TF binding sites (TFBS) over your chosen assembly of Arabidopsis, in BED format. (Make sure that this result is sorted per sort-bed, as described in the SE answer.)

  4. Look for overlaps of, say, three or more bases between the file of padded TSSs (promoters.bed) and the TFBS that came out of running FIMO (fimo.bed):

    $ bedmap --echo --echo-map-id-uniq --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
    

    Or if you want the full TFBS annotation, and not just the TF model names:

    $ bedmap --echo --echo-map --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
    
  5. Repeat steps 1-4 of this analysis for background ("random") selections of genes over the whole genome. You could use shuf -n or sample or similar to get a random sample of genes from a text-formatted annotations file, then convert them to background promoters.

    Once you have a collection of TF model names for your genes-of-interest and for a random selection of genes-over-background, you could use a hypergeometric test to determine if any particular TFs are enriched, given the genes-of-interest.

    The following answer may help describe the use of this test in a more concrete way, for a similar scenario: A: Calculate if the co-occurring of two TFBSs is higher than one would expect by ch

ADD COMMENT
0
Entering edit mode

Thank you Alex for your help, but I am using R3.5.1 for Windows. Can you help me please?

ADD REPLY
0
Entering edit mode

I apologize, but I don't have any R-based way to do this, except for the hypergeometric test portion of the answer.

ADD REPLY
0
Entering edit mode

ok , thank you very much Alex.

ADD REPLY
0
Entering edit mode

Hi Alex, thanks a lot for pointing me to the answer! Couple questions: 1. step4 is filtering to keep TFBS overlapping with annotated promoter regions? I am not very familiar with FIMO yet, why would it output TFBS that are outside of the provided promoter region? 2. step5 is essentially what Homer is doing as the last step to identify enriched motif? If I have one gene, and trying to find the TF binding motif for that gene, can I still use this approach?

ADD REPLY
0
Entering edit mode

Some possible answers:

  1. In the approach I describe, FIMO scans the entire genome for binding sites (in practice, minus masked regions) — more than just promoters. So using bedmap restricts the set of binding sites to those which overlap by at least (in this case) three bases with a promoter region. You could specify other overlap criteria — something more or less relaxed — depending on how stringent you want to be. The idea is to connect promoters for your gene of interest, by filtering the TF binding sites that FIMO calls for those which overlap your promoters.

  2. Someone more familiar with Homer may have a different answer, but I believe Homer is analogous to the MEME tool, which discovers motifs, not binding sites. MEME scans the regions of interest, looking for patterns of sequences, calling them motifs. The output of MEME is one or more PWMs — positional weight matrices — which give a numerical definition to these motifs. FIMO takes these PWMs as input and looks over the genome for sites where these patterns are found. If the PWMs are made from data for transcription factors, then these sites would be called TFBS. I may be wrong about what Homer does, but I think that's what it does, mainly.

  3. You could certainly use these steps for looking for TFBS in only one gene's promoter. You might still keep the FIMO results around, though, in case you run into other genes-of-interest where you might be interested in regulatory TFs. You might also explore regulatory networks connected to your tissue of interest and potential TFs of interest (http://regulatorynetworks.org, for instance).

This is a computational approach. While FIMO can run some numbers and tell you that a binding site exists with so-and-so statistical significance, ChIP-seq data is also useful for giving experimental, biological confirmation that a TF actually binds to a gene's regulatory regions. Queries to ENCODE might be useful here, also (https://encodeproject.org).

ADD REPLY

Login before adding your answer.

Traffic: 1479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6