Question

Clustering of CNV genomic coordinates, take 2

1

Entering edit mode

6.2 years ago

Sakti ▴ 510

Dear Biostars,

After searching the internet for quite a while, I have yet to find an easy solution for clustering of human genomic coordinates. This post asked the same question a couple of years ago, but there was no answer as to how one could simply cluster a bed file and be able to graph it in IGV (or any of your favorite genome graphers), and make it look like this figure.

Here's the breakdown of the problem at hand:

Data type: Human CNV data detected by both array and sequencing. Output from these analysis is a .bed file with the CNV positions, similar to this:

chr    start    end    cnv_id    sample_name    sample_category

Clustering type: anything rolls, from k-means to unsupervised.

Question: Are there samples that preferentially cluster together because they share very similar CNV positions? Is this clustering of CNVs meaninful given the sample category (i.e. sick vs normal)?

I have read about CNVTools, which to my understanding needs probe intensities; I could never get iCluster to work; IGVTools doesn't have a clustering function; I'm unsure seqMINER or any other TSS/ChIP clustering algorithm will work with longer stretches of DNA sequence; and everything I have read about clustering methods in R revolves around single genes/values and not genomic coordinates.

It is why I appeal to the Biostars wisdom once more. I'd be grateful if someone could recommend a solution to this problem.

Thanks!

Sakti

cluster analysis genomic coordinates cnv bed • 2.1k views

ADD COMMENT • link updated 6.2 years ago by Sean Davis 26k • written 6.2 years ago by Sakti ▴ 510

0

Entering edit mode

What data are you trying to cluster? What is the assay and what is the question you want to answer? Are you dealing with copy number data, or something else? Sequence-based, or array?

ADD REPLY • link 6.2 years ago by Sean Davis 26k

0

Entering edit mode

Hi Sean, thanks for commenting. I have updated the post with the answers to your questions.

ADD REPLY • link 6.2 years ago by Sakti ▴ 510

score 1 · Answer 1 · 2018-01-30

1

Entering edit mode

6.2 years ago

Sean Davis 26k

There is not a general approach to dealing with these types of data that I know of and you have multiple questions that you seem to be asking of your data. That said, one approach you might find useful to define a set of genomic "bins" across the genome and then build a matrix of: SAMPLE x BIN. Each cell of the matrix has a TRUE (or 1) if the sample has a CNV that overlaps that genomic region. Tools like bedtools or GenomicRanges might help with that task.

From there, more standard matrix-based approaches are available for clustering and statistical testing.

ADD COMMENT • link 6.2 years ago by Sean Davis 26k

0

Entering edit mode

Thanks a lot Sean! I was pondering the genomic bins solution, which seems what will work in the end for my data. Thanks!!

ADD REPLY • link 6.2 years ago by Sakti ▴ 510