Question

Data integration of different results

1

Entering edit mode

9.1 years ago

int11ap1 ▴ 470

We have processed different datasets of Hi-C, RNA-Seq and Chip-Seq data. Now, with the results, we would like to:

integrate all these data into a database for further queries.
perform a statistical analysis or something like that, the aim of which is to analyse a sequence and get a result like "the sequence might be transcriptionally active or not". Maybe any machine-learning method using our data? Python?

Do you have any thought, ideas or papers that I could read?

integration • 2.4k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by int11ap1 ▴ 470

Ram · Answer 1 · 2015-03-27

The field of data-integration is still extremely young - even the definition of what counts as 'integration' is not all that standardized.

Some people will show multiple results from the same assay type (calculated independently) on the same chart, and call that integration.

Some people will show multiple results from different assay types (but still calculated independently) on the same chart, and call that integration.

Some people will show a result, calculated from multiple different assay types dependently, on a chart or charts and call that integration.

Personally I would call the first an extensive analysis, the second a comparative analysis, and only the third integrative analysis - but too often people say they are doing integrated analysis and really mean 'looking at RNA expression vs ChIP-Seq expression on the same scatter plot'. The data here was not integrated it was compared.

So, to address the question of integration, there exists only two methods that I know of. The first is ChromHMM/etc which calculate 'chromatin states' via machine learning as you mentioned. The Roadmap project recently published several papers where ChromHMM featured highly. Personally, I have my doubts about the usefulness/information learned from such methods - but I think I am very much in the minority here on Biostars regarding that. Many find it a useful abstraction for further, more comprehensive analysis.

The second is looking for correlation between phenotype-inducing regions on the genome (SNPs, GWAS), and regions of the genome highlighted as special by Chip-Seq/RNA-Seq/etc. Although this is strictly a multi-dimensional comparison, no more 'integrated' than a scatterplot, if the result is an array of correlation's which could be plotted against another set of data, it could be considered integrated.

I think we're some way away from a truly integrated analysis - where the data from multiple experiments come together to give a new dataset that is more than the sum of its parts - and the reason for that is many labs still struggle with data comparison. There are so many data formats and nuances between the different programs that huge mistakes are easy to fall into. For example - sequencing data (SAM) to signal (wig/bed) conversion is very domain specific. Each biological assay has several tools which the field knows about to convert their reads into signal - but these tools, eager to separate themselves from other tools in their field, end up being widely different and ultimately incompatible between assay types, even if the output format is similar. For example, in ChIP-Seq we have an excellent tool in the deeptools package called bamCoverage, which defines signal as the whole fragment between pairs (and some other nice tricks for singletons). It also gives the ability to normalise based on READ COUNT.

RNA-Seq sequencing to signal tools like Tophat define signal very differently. Signal is given to only the area under the read/mate, or a transcript/exon with normalization to transcript length and total reads.

So why isn't it easy to compare the two?

Because they have differing amounts of signal (in total) and different dynamic range of signal. The end result means comparisons between the two will almost certainly violate any statistical test (which is why ChromHMM binarizes all inputs to a 0 or a 1).

So I hope that helps in some small way to seeing the difficultly here in doing a comparative analysis, let alone an integrated one. Databases will certainly help, since a database schema requires a well defined and structured data format. Personally I recommend PostgreSQL for genome wide data, and Neo4j for feature data - but then again I'm a little biased since my PhD is all about doing exactly that with all of the data you mentioned above (+ DNAse and WGBS ... somehow...)

Good luck! And if you have any success i'd love to hear about it :)