I am a medical student with minimal bioinformatics/mathematics background trying to understand a paper that requires a decent amount of both. The paper is looking at fecal microbiomes of children with or without kwashiorkor, and I am having trouble understanding how they processed their data. Here is the part describing what they did –
"DNA prepared from fecal samples was subjected to multiplex shotgun pyrosequencing, and the resulting reads were annotated by comparison to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and to a database of 462 sequenced human gut microbes (table S3). To visualize the variation in this data set, we used principal coordinates analysis of Hellinger distances computed from the KEGG enzyme commission number (EC) content of fecal microbiomes (Fig. 1A and fig. S1). Principal coordinate 1 (PC1), which explained the largest amount of variation, was strongly associated with age and family membership (Fig. 1A)."
(from: Gut Microbiomes of Malawian Twin Pairs Discordant for Kwashiorkor; Smith et al., 2013. PMID:23363771)
My understanding is that Hellinger distances are a way of telling you how different datasets are from one another, and that PCA is a means of reducing complex datasets down to single values for easier comparison.
Basically, what I'd like to know is: in conceptual/layman's terms (as much as possible), how are these calculated? I've looked at the mathematical explanations and am just not making the leap to a conceptual understanding.
Thank you so much to anyone who can explain (or let me know if there's a better place to ask this question)!