Question

Identify Target genes of an enhancer

1

Entering edit mode

7.8 years ago

Bioinformatist Newbie ▴ 270

I have the peak file of Active Enhancer mark (H3K27ac). I have decided the 100kb region up and down stream of TSS to link genes to the Enhancer. My peak file is as below:

chr8    145701383   145702686   Rank_1  88  .   5.97704 12.09649    8.84716
chr20   33850820    33852880    Rank_2  78  .   4.68799 10.87448    7.89185

I want to get a list of genes which are within 100kb region of these peak coordinates. Can anybody guide me how I can do that. I was looking into ChIPseeker package (annotatPeak function) but I was not getting a clue how to do it!

For example if I say first peak the "Enhancer_1" then I want to link all the TSS that lies with in 100 kb region domain with this enhancer, like a column containing all target genes as: Enhancer_1: Gene1, Gene2, Gene3 ... Gene20

Enhancer Peaks • 3.0k views

ADD COMMENT • link updated 7.6 years ago by Giovanni M Dall'Olio 28k • written 7.8 years ago by Bioinformatist Newbie ▴ 270

score 5 · Answer 1 · 2016-09-06

In R:

> library(Homo.sapiens) # from bioconductor
> library(dplyr) # optional, but for the %>% thing

# The import function from GRanges would be the way to import BED files,
# but in this case doesn't work correctly because there are extra columns
> # import("example.bed", "BED") 
> tf = read.delim("example.bed", header=F, stringsAsFactors=F)  %>% 
     dplyr::rename(chr=V1, start=V2, end=V3) %>% 
     makeGRangesFromDataFrame(., keep.extra.columns=T)

# Get the coordinates of all TSS in human genome. The TxDB object is part of Homo.sapiens.
# To get only one TSS per gene, use the genes() function instead of transcripts()
> human.tss = resize(transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c('gene_id', 'tx_id')), width=1, fix='start')

# Get TSS within 100000 bp of your tf list:
> mergeByOverlaps(human.tss, tf + 100000)  #see also subsetByOverlaps
DataFrame with 54 rows and 9 columns
                     human.tss         gene_id     tx_id
                     <GRanges> <CharacterList> <integer>
1   chr8:145617095-145617095:+          203054     32948
2   chr8:145660551-145660551:+                     32949
3   chr8:145691738-145691738:+           90990     32950
4   chr8:145697960-145697960:+                     32951
5   chr8:145715417-145715417:+           84988     32952
...                        ...             ...       ...
50   chr20:33872619-33872619:-            3692     71966
51   chr20:33880225-33880225:-          128876     71967
52   chr20:33880225-33880225:-          128876     71968
53   chr20:33894862-33894862:-           55245     71969
54   chr20:33893057-33893057:-                     71979
                  tf...1e.05          V4        V5          V6               V7
                   <GRanges> <character> <integer> <character>      <character>
1   chr8:145601383-145802686      Rank_1        88           . 5.97704 12.09649
2   chr8:145601383-145802686      Rank_1        88           . 5.97704 12.09649
3   chr8:145601383-145802686      Rank_1        88           . 5.97704 12.09649
4   chr8:145601383-145802686      Rank_1        88           . 5.97704 12.09649
5   chr8:145601383-145802686      Rank_1        88           . 5.97704 12.09649
...                      ...         ...       ...         ...              ...
50   chr20:33750820-33952880      Rank_2        78           . 4.68799 10.87448
51   chr20:33750820-33952880      Rank_2        78           . 4.68799 10.87448
52   chr20:33750820-33952880      Rank_2        78           . 4.68799 10.87448
53   chr20:33750820-33952880      Rank_2        78           . 4.68799 10.87448
54   chr20:33750820-33952880      Rank_2        78           . 4.68799 10.87448

EDIT: for some reasons I misread your question at first, so I wrote how to get the closest genes instead. I'll leave this year for legacy.

# Get the closest genes
> neighbor.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)[nearest(tf, genes(TxDb.Hsapiens.UCSC.hg19.knownGene))]

# Get the identifier of the original TF, so you can know which TF is closest to each gene:
> mcols(neighbor.genes)$tf = tf$V4  

> neighbor.genes
GRanges object with 2 ranges and 2 metadata columns:
        seqnames                 ranges strand |     gene_id          tf
           <Rle>              <IRanges>  <Rle> | <character> <character>
   8928     chr8 [145699115, 145701718]      - |        8928      Rank_1
  55741    chr20 [ 33703160,  33865960]      - |       55741      Rank_2
  -------
GRanges object with 2 ranges and 1 metadata column:
        seqnames                 ranges strand |     gene_id
           <Rle>              <IRanges>  <Rle> | <character>
   8928     chr8 [145699115, 145701718]      - |        8928
  55741    chr20 [ 33703160,  33865960]      - |       55741
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome