Question

TFBS enrichiment analysis

3

Entering edit mode

5.2 years ago

mannoulag1 ▴ 120

Hi, I have a list of Arabidopsis gene symbols, how to do the enrichment analysis to identify their Transcription factors? It is possible to do this by TFBStools package? thanks

tfbstools R arabidopsis jaspar gene • 3.1k views

ADD COMMENT • link updated 4.1 years ago by Alex Reynolds 35k • written 5.2 years ago by mannoulag1 ▴ 120

0

Entering edit mode

What do you mean their transcription factors? TFs that bind to their promoters? Regardless, HOMER or AME from the MEME suite are probably your best bets. From a quick glance, it doesn't seem like the TFBStools package does motif enrichment analyses.

ADD REPLY • link 5.2 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

thank you very much, but I am looking for a R package.

ADD REPLY • link 5.2 years ago by mannoulag1 ▴ 120

0

Entering edit mode

PWMenrich may work for you then, though I've never used it and can't vouch for its results/ease of use.

ADD REPLY • link 5.2 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

thank you, I will try it.

ADD REPLY • link 5.2 years ago by mannoulag1 ▴ 120

score 7 · Answer 1 · 2019-01-21

One possible approach is described below, which involves some manual work, but which facilitates a bit more control over inputs, outputs, and parameters:

Get the annotations of your genes. The positions of these annotations should, ideally, match the assembly version you are using for FIMO calls, described below.

These annotations should be formatted in a sorted BED6+ file, or converted to one via gtf2bed, gff2bed or other conversion tools that output sorted BED, with the ID in the fourth column and the strand information in the sixth column.

Pad out -1000/+200 of the annotation TSSs, by strand, via, e.g. bedops:

$ awk '($6 == "+"){ print $1, ($2-1), $2, $4 }' annotations.bed | bedops --range -1000:200 --everything - > tss.pad.for.bed
$ awk '($6 == "-"){ print $1, $3, ($3+1), $4 }' annotations.bed | bedops --range -200:1000 --everything - > tss.pad.rev.bed
$ bedops --everything tss.pad.for.bed tss.pad.rev.bed > promoters.bed

You might change these bounds depending on what you define as a promoter, or other regulatory region where TFs would bind to and regulate gene activity.

Do a FIMO scan at 1e-4 or other p-value threshold against your plant TF database(s) of choice (TRANSFAC, JASPAR, Athamap and CIS-BP are possibilities, for instance).

I have an answer on the Bioinformatics SE site that explains how to do a FIMO scan for hg19 (human), against the non-redundant JASPAR vertebrate TF model database: https://bioinformatics.stackexchange.com/a/2491/776

If you use FIMO, you would repeat this or something similar for your assembly of Arabidopsis and for published TF model databases for Arabidopsis.

The output of FIMO will be a collection of TF binding sites (TFBS) over your chosen assembly of Arabidopsis, in BED format. (Make sure that this result is sorted per sort-bed, as described in the SE answer.)
Look for overlaps of, say, three or more bases between the file of padded TSSs (promoters.bed) and the TFBS that came out of running FIMO (fimo.bed):
```
$ bedmap --echo --echo-map-id-uniq --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
```
Or if you want the full TFBS annotation, and not just the TF model names:
```
$ bedmap --echo --echo-map --delim '\t' --bp-ovr 3 promoters.bed fimo.bed > answer.bed
```
Repeat steps 1-4 of this analysis for background ("random") selections of genes over the whole genome. You could use shuf -n or sample or similar to get a random sample of genes from a text-formatted annotations file, then convert them to background promoters.

Once you have a collection of TF model names for your genes-of-interest and for a random selection of genes-over-background, you could use a hypergeometric test to determine if any particular TFs are enriched, given the genes-of-interest.

The following answer may help describe the use of this test in a more concrete way, for a similar scenario: A: Calculate if the co-occurring of two TFBSs is higher than one would expect by ch