It's a bit of work, maybe, but perhaps the following set operations could guide some investigation.
Your example TF is MA0528.1, which is a Jaspar identifier.
For your genome of interest, you could run that genome's sequence through FIMO to call binding sites of Jaspar TF models at some threshold, say 1e-4 or 1e-5. Say this file is called tbfs.jaspar.1e-5.bed
.
Given a set of whole-genome binding sites, you can then filter that set using the proximal promoters of all genes of interest (genes.bed
). These could be Gencode genes in GFF format, converted to BED via gff2bed
, or by way of similar approaches.
Proximal promoters could be defined as a 1kb region upstream of the gene's TSS:
$ bedops -u <( awk ($6="+") genes.bed | bedops --range -1000:0 - ) <( awk ($6="-") genes.bed | bedops --range 0:1000 - ) > promoters.bed
Then filter the whole-genome TFBS set :
$ bedops --element-of 1 tbfs.jaspar.1e-5.bed promoters.bed > tbfs.jaspar.1e-5.subset.bed
Then grep
this subset for MA0528.1
:
$ grep MA0528.1 tbfs.jaspar.1e-5.subset.bed > MA0528.1.hits.bed
and map these hits back to the genes:
$ bedmap --range 1000 --echo --skip-unmapped genes.bed MA0528.1.hits.bed > answer.bed
You might add TF-specific ChIP-seq data overlaps as experimental evidence of concordance of gene promoters derived from answer.bed
with TFs of interest actually binding to those regions in real life.
Some transcription factors do, literally, have many thousands of targets. Look up oestrogen ('estrogen', in US english) receptor α (alpha), Myc, and Pten, for example. Keep in mind that a transcription factor doesn't know what are its targets... it just binds wherever there is an electromagnetic / 'electrochemical' potential such that it can bind, which is mediated via target DNA sequence motifs and binding sites on the transcription factor. Where binding is sufficiently strong, it may exert its effects; where binding is not strong, the effect may be weaker or non-existent. Also, the target regions have to be accessible for binding to occur - different regions of chromatin will be 'open' (accessible) in different tissues due to tissue-specific differences. These can be gauged by ATAC-seq.
Using the programs that you have already tried, you should be able to order the targets by some sort of score and/or decide whether tissue-specific differences may be at play.