Question

Find Transcription factor binding site for a gene list

0

Entering edit mode

5.6 years ago

salvatore.raieli2 ▴ 90

Hi everyone,

I have a list of genes that are different expressed between tumor and normal tissue. I have two drugs that block two different transcriptional factor (let's say TF1 and TF2). What I want to do is to find which of these genes have in their promoter a TF binding site for TF1, for TF2 and for both.

For a similar task with a singular gene I did this:

library("TFBSTools")
library("JASPAR2018")
library("Biostrings")
#to retrieve the matrix for a specific tf in thre jaspar database
opts <- list()
opts[["species"]] <- 9606
opts[["name"]] <- "MYC"
PFMatrixList <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(PFMatrixList, pseudocounts=0.8) #generation of PWM matrix for MYC
seq1 <- read.delim(...)  #the sequence of a gene with the promoter I downloaded from NCBI
subject <- DNAString(seq1) #making as DNA string
#finding the site in the sequence
siteset <- searchSeq(pwm, subject, seqname="seq1", min.score="60%", strand="*")

I have around 100 genes I think that this approach is impraticable.

Thank you in advance

Salvo

R TFBS promoter • 4.4k views

ADD COMMENT • link updated 5.5 years ago by Praneet Chaturvedi ▴ 120 • written 5.6 years ago by salvatore.raieli2 ▴ 90

2

Entering edit mode

First of all, what you aim to find is a transcription factor motif. If a motif is also a binding site needs to be determined by experiment, because not every motif automatically means that a TF will bind there. For example, a motif in heterochromatin (and there a thousands of "unused" motifs for every factor across the genome) is unlikely to ba bound by a TF. You should have a look at FIMO from the MEME suite. It takes as input a fasta file with sequences, e.g. your promoter sequences, and a motif position frequency matrix, e.g. from JASPAR, and then scans the sequences for motif occurrence, outputting a GFF file.

ADD REPLY • link 5.6 years ago by ATpoint 81k

0

Entering edit mode

thank you, I will give a look to FIMO. I am not sure it is what I need. If I understand well I have to provide the FASTA for each gene I want to scan, right? Meaning that I have to manually download 100 genes and their promoters in the FASTA format. With FIMO can I automize this process?

ADD REPLY • link 5.6 years ago by salvatore.raieli2 ▴ 90

1

Entering edit mode

You can pass a multifasta to FIMO, in the form:

>chr1:1-100
ATAGCTACG(...)
>chr2:1:1000
ATGGACTA(...)

Export the list of genomic coordinates to disk, and then use bedtools getfasta to make the multifasta. This can be fed into FIMO, which is fully automated from this point on.

ADD REPLY • link 5.6 years ago by ATpoint 81k

0

Entering edit mode

If I understood well, I get the genomic coordinates for my list of genes and with bedtools getfasta I obtain the fasta, I take all the fasta in one multifasta that I feed to FIMO, right?

ADD REPLY • link 5.6 years ago by salvatore.raieli2 ▴ 90

1

Entering edit mode

If you feed a BED file with all the coordinates you have (your promoter regions), then bedtools will produce a multifasta file, each line representing one line of coordinates:

$ cat test.bed 
chr1    100000  100010
chr2    200000  200020

$ bedtools getfasta -fi hg38_noALT_withDecoy.fa -bed test.bed 
>chr1:100000-100010
ACTAAGCACA
>chr2:200000-200020
GTCTTAATATATACATAGGT

This multifa you feed into FIMO.

ADD REPLY • link 5.6 years ago by ATpoint 81k

0

Entering edit mode

how I can generate a BED file just for the promoter regions of a list of genes?

ADD REPLY • link 5.6 years ago by salvatore.raieli2 ▴ 90

1

Entering edit mode

By querying a database of promoter regions if available, and if not available, creating one.

ADD REPLY • link 5.6 years ago by Ram 43k

1

Entering edit mode

For simplicity, you could take a window upstream of your genes, say 250bp.

ADD REPLY • link updated 5.6 years ago by WouterDeCoster 47k • written 5.6 years ago by ATpoint 81k

0

Entering edit mode

I was thinking something similar if I am not finding a good database

ADD REPLY • link 5.6 years ago by salvatore.raieli2 ▴ 90

0

Entering edit mode

I think you do not need a database. A promoter is always upstream of the first exon. When you do ATAC-seq (scan for open chromatin), a typical peak from a nucleosome-free region is hardly larger than 200bp. For motif scanning, in order to avoid matches by change, you anyway should limit the size of the regions you scan. I would go for e.g. 50bp downstream and 200bp upstream of the first exon, resulting in the 250bp, and then run the analysis. See if it works out, which I think it should.

ADD REPLY • link 5.6 years ago by ATpoint 81k

Ram · Answer 1 · 2018-10-31

1

Entering edit mode

5.5 years ago

Praneet Chaturvedi ▴ 120

You can use Homer's capability to get TF binding sites in the promoter regions of genes of interest

There are some databases as well:

TF2DNA [Allows batch search]
CISBP [requires sequences of promoters]

Cheers !!

ADD COMMENT • link updated 5.5 years ago by Ram 43k • written 5.5 years ago by Praneet Chaturvedi ▴ 120