Question

tools for gene annotation

2

Entering edit mode

9.2 years ago

Affan ▴ 300

I have data from the tool ChromHMM. What it does is split up the genome/chr into bins (say 200bp) and assign each bin a state (for those interested, the assignment is based on an observation sequence which are the combination of histone modifications - see Earnst/Kellis 2009 paper).

The data looks like the following (in a .bed file):

dense file:

chr10   0       3020800 13      0       .       0       3020800 255,255,204
chr10   3020800 3021600 16      0       .       3020800 3021600 102,153,51
chr10   3021600 3022200 13      0       .       3021600 3022200 255,255,204
chr10   3022200 3022600 6       0       .       3022200 3022600 0,102,0
chr10   3022600 3033600 13      0       .       3022600 3033600 255,255,204
chr10   3033600 3034200 2       0       .       3033600 3034200 0,153,204
chr10   3034200 3034400 6       0       .       3034200 3034400 0,102,0
chr10   3034400 3036800 13      0       .       3034400 3036800 255,255,204
chr10   3036800 3037200 1       0       .       3036800 3037200 0,0,255
chr10   3037200 3040800 13      0       .       3037200 3040800 255,255,204

or alternative file:

chr10   0       3020800 E13
chr10   3020800 3021600 E16
chr10   3021600 3022200 E13
chr10   3022200 3022600 E6
chr10   3022600 3033600 E13
chr10   3033600 3034200 E2
chr10   3034200 3034400 E6
chr10   3034400 3036800 E13
chr10   3036800 3037200 E1
chr10   3037200 3040800 E13

Basically, what is says that all the bins of size (200) from position 0 to 3020800 were assigned state 13 (~15000 bins). I also have another file that tells me the state PER bin but this is incredibly large file. That simple looks like this:

cell_MB chr10
MaxState E
13
13
13
13
13
13
13
13

What I want to do

Calculate the distance from the bin to the nearest gene. Get the gene information. I will use DAVID to perform GO.
percentage of bin (for a fixed state, say E13) that are within 2kb of a TSS region

It is the second bullet point that is more important to me.

Does anyone know of a tool to do this or a R package to do this? Like I mentioned, I have the data is three different formats so any tool/package that accepts these files would be awesome.

I ask this here because previously I've coded stuff from scratch in R taking months and then realizing a R package already exists and does exactly what I coded.

GO-analysis • 1.8k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 9.2 years ago by Affan ▴ 300

Ram · Answer 1 · 2015-03-11

R packages IRanges (basic interval arithmetics), GenomicRanges (Genomic intervals and intersections), rtracklayer (parsing gff, bed,...), and biomaRt (getting annotated genomic regions) will be very useful. Links follow. You don't need to program, just query by calling functions, possibly struggle a bit parsing your input.

There is a function nearest that does find the nearest feature in a IRanges/GRanges object for given coordinates
To do this, load the TSS annotations and the selected bins into a GRanges object and check for overlaps, e.g. subsetByOverlaps or findOverlaps