possible to aquire genomic region data in rows and columns format for machine learning purpose
1
0
Entering edit mode
7.4 years ago

Hello, This might be a very naive approach but I'm just curious:

Is it possible make a file in BED-like format in such a way that each row of data represents a region of the genome and then have different columns to specify the length, the GC content, the DNA methylation, the nucleosome occupancy, Nucleosome Repeat Length and as many features as one could gather, of each region? (if anyone could suggest other features, that would be great)

I ask this with the intention of trying to apply machine learning algorithms to try to predict genomic features.

If it is possible how would I go about collecting the data from different GEO datasets such that each data point that I collected was aligned with the region of DNA that it was supposed to be characterising....

Furthermore would it matter if separate rows were to document overlapping DNA regions?

genome • 1.2k views
ADD COMMENT
0
Entering edit mode

I recently did this for a enhancer prediction software I developed. The hardest part of machine learning is the data cleaning, the actual implementation of the algorithm is the easiest part.

If you're gonna be mixing and matching datasets from GEO then you're gonna in for a hard time, there's nothing really that will give you the information you want neatly aligned the way you want it. You'll be doing a lot of joins, and cutting and pasting columns. I suggest you keep VERY detailed documentation of this, if you mess up the data cleaning step then your entire model is worse than worthless since you come to incorrect conclusions. It IS possible though, it'll just be a lot of work on your end. And without specific genomic regions you want to target, I can't offer any more suggestions for genomic features.

ADD REPLY
0
Entering edit mode
7.4 years ago

Data such as those which you describe can be obtained using the UCSC Table Browser

ADD COMMENT

Login before adding your answer.

Traffic: 1480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6