How To Find Upstream/Downstream Bound Features With A Chip-Seq Analysis In Galaxy
3
0
Entering edit mode
10.1 years ago

HI!

I have a set of ChIP-seq data, and I would like to find out wheather transcription factor KLF1 binds both upstream (i.e. in the promoter) and downstream of the transcription start site (TSS). For this analysis I have been used the RefSeq annotation and define the upstream and downstream regions of interest to be the 1000 nucleotides upstream and downstream of the TSS, respectively. Furthermore I have been using first Get Flanks tool and then INNERJOIN tool and find that 103 places that overlap with my peaks. But how do I find out which of these 103 bounds upstream and downstream, respectively?

ChIP-seq • 5.8k views
ADD COMMENT
3
Entering edit mode
10.0 years ago

If you can wrap these command-line operations into Galaxy, then BEDOPS can help.

First, sort your input BED files. The sorted TSSs.bed file contains your TSSs and the sorted TFs.bed contains all the sites for transcription factors:

$ sort-bed TSSs.unsorted.bed > TSSs.bed
$ sort-bed TFs.unsorted.bed > TFs.bed

​To find 1 kb upstream hits:

$ bedops --range -1000:0 --everything TSSs.bed \
​    | bedmap --echo --echo-map --delim '\t - TFs.bed \
    | grep -w "KLF1" - \
    | bedops --range 1000:0 - \
    > TSSsContainingUpstreamKLF1hits.bed

To find 1 kb downstream hits:

$ bedops --range 0:1000 --everything TSSs.bed \
​    | bedmap --echo --echo-map --delim '\t' - TFs.bed \
    | grep -w "KLF1" - \
    | bedops --range 0:-1000 - \
    > TSSsContainingDownstreamKLF1hits.bed

The last column of both output BED files contains a semi-colon delimited list of any KLF1 hits upstream and downstream of each qualifying TSS.

If you want to have all the hits in one output file:

$ bedops --range -1000:1000 --everything TSSs.bed \
​    | bedmap --echo --echo-map --delim '\t' - TFs.bed \
    | grep -w "KLF1" - \
    | bedops --range 1000:-1000 - \
    > TSSsContainingAllKLF1hits.bed
ADD COMMENT
0
Entering edit mode
10.1 years ago
bede.portz ▴ 540

I don't use galaxy for this type of analysis, but one workaround could be to separately map your KLF1 ChIP-seq reads to a window 1kb upstream from the Ref-Seq TSS, and also 1kb downstream of the TSS. This will result in two files, each with a Gene ID/TSS list, which you can join using galaxy to find those TSS bound both upstream and downstream by joining on the column containing the gene ID. Bear in mind that mammalian genes can have multiple TSS per gene, and not all of these TSS may actually be utilized in a given cell type, under certain conditions,etc. Thus, you may identify a peak of KLF1 as existing downstream of an annotated TSS, but in actuality it may be upstream of the actual utilized TSS in your cells, or vice versa. If you haven't already considered this, it may be worthwhile to refine the RefSeq TSS list to include only those not within X number of base pairs from another RefSeq TSS, and/or to remove those genes with multiple TSS. I can say from experience that this filtering can dramatically reduce the RefSeq TSS list by many thousands of TSS, and in doing so may alter your results with respect to what genes are bound by KLF1 both upstream and downstream of the TSS.

ADD COMMENT
0
Entering edit mode
10.0 years ago
Ming Tommy Tang ★ 3.9k

basically, you want to annotate the peaks, you can use Cistrome built into Galaxy for this http://cistrome.org/ap/

ADD COMMENT

Login before adding your answer.

Traffic: 2466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6