Question: Question About Hellinger Distances And Principle Coordinates Analysis
1
Entering edit mode
8 months ago
bioinformaticshelp • 10

I am a medical student with minimal bioinformatics/mathematics background trying to understand a paper that requires a decent amount of both. The paper is looking at fecal microbiomes of children with or without kwashiorkor, and I am having trouble understanding how they processed their data. Here is the part describing what they did –

"DNA prepared from fecal samples was subjected to multiplex shotgun pyrosequencing, and the resulting reads were annotated by comparison to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and to a database of 462 sequenced human gut microbes (table S3). To visualize the variation in this data set, we used principal coordinates analysis of Hellinger distances computed from the KEGG enzyme commission number (EC) content of fecal microbiomes (Fig. 1A and fig. S1). Principal coordinate 1 (PC1), which explained the largest amount of variation, was strongly associated with age and family membership (Fig. 1A)."

(from: Gut Microbiomes of Malawian Twin Pairs Discordant for Kwashiorkor; Smith et al., 2013. PMID:23363771)

My understanding is that Hellinger distances are a way of telling you how different datasets are from one another, and that PCA is a means of reducing complex datasets down to single values for easier comparison.

Basically, what I'd like to know is: in conceptual/layman's terms (as much as possible), how are these calculated? I've looked at the mathematical explanations and am just not making the leap to a conceptual understanding.

Thank you so much to anyone who can explain (or let me know if there's a better place to ask this question)!

ADD COMMENTlink 5.8 years ago bioinformaticshelp • 10 • updated 8 months ago Biostar 20
5
Entering edit mode
5.8 years ago
xb • 400
Chapel Hill, NC,USA

"Principal Coordinates Analysis" is the key here, abbreviated usually as PCoA. While PCA normally stands for "Principle Component Analysis". The two approaches have very similar spirit to reduce the number of dimensions of a complex dataset for exploration and visualization. Here is a nice summary to read.

PCoA takes a pairwise distance/dissimilarity matrix as input. Any distance function can be chosen. It is very useful when one can only measure the distance between (every possible pair of) the objects but has little information to obtain the raw values of the objects or not easy to represent such values in simple numeric forms - like the case in the paper you cited.

According to the paper's supplementary material, these are the "conceptual" steps to go

  1. Compute the normalized EC counts with all samples;
  2. Calculate the Hellinger distance between all pairs of samples;
  3. Conduct the PCoA given the Hellinger distance matrix.

Also helpful to read,

Ramette's review of common multivariate methods (PCoA. PCA, etc.) http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2121141

And if you use R,

Hellinger distance - an implementation in R http://www.inside-r.org/packages/cran/distrEx/docs/HellingerDist

PCoA in R http://www.inside-r.org/packages/cran/ape/docs/pcoa

ADD COMMENTlink 5.8 years ago xb • 400 • updated 5.8 years ago Neilfws 48k
Entering edit mode
0

very good answer, I didn't even noticed that the user was asking about PCoA instead of PCA.

ADD REPLYlink 5.8 years ago
Giovanni M Dall'Olio
26k
Entering edit mode
0

Thank you for your help! Hopefully this won't be a nonsensical follow up question – are principle coordinates themselves meaningful values? Does a principle coordinate of, say, 0.2 mean anything conceptually, or is it essentially an arbitrary number that is only meaningful in relation to the principle coordinates assigned to other objects?

ADD REPLYlink 5.8 years ago
bioinformaticshelp
• 10
Entering edit mode
0

Hi! Those are relative values, comparable within the same set/matrix of the principle coordinates. Usually, a _scatter plot_ , a _histogram_ , a _boxplot_ or simply a summary function (in _R_ ) can help to display the range of the coordinates and thus help to identify whether a particular coordinate is low or high within the population in question.

ADD REPLYlink 5.8 years ago
xb
• 400
Entering edit mode
0

Really appreciate your help – thank you so much!

ADD REPLYlink 5.8 years ago
bioinformaticshelp
• 10

Login before adding your answer.

Powered by the version 1.5