Question

Reads per COG normalization

0

Entering edit mode

5.0 years ago

Diego Morais • 0

Hello everyone,

I alligned a metagenome sample using the eggnog-mapper software, and processed the output to get the number of reads mapped to each COG (Clusters of Orthologous Groups).

However, some COGs have a very high number of reads mapped. This happens for COGs that include many genes and species, e.g. COG0001 contains 4869 genes in 3358 species (https://version-11-0.string-db.org/cgi/network.pl?networkId=t67BvacQZjAN).

Is there some way to normalize the number of reads mapped to each COG taken this information into account?

alignment COG metagenomics • 1.4k views

ADD COMMENT • link updated 4.8 years ago by Mensur Dlakic ★ 27k • written 5.0 years ago by Diego Morais • 0

0

Entering edit mode

There are a lot of ways to normalize the data, the only question is why do you want to normalize it?

ADD REPLY • link 4.8 years ago by Asaf 10k

0

Entering edit mode

Hi Asaf,

The question is: "Given a metagenomic sample, which COGs are enriched in this sample?"

I intend to do the enrichment of COGs found in a sample using the number of hits to each COG, or some normalization applied over this information, as well as the string COG-COG interaction data.

ADD REPLY • link 4.8 years ago by Diego Morais • 0

1

Entering edit mode

You can only compare COGs between different set of samples, in this case just run DESeq2 or edgeR on your count table and they will do the normalization. I would recommend to use a set of single copy proteins for normalization as in MUSiCC

ADD REPLY • link 4.8 years ago by Asaf 10k

score 1 · Answer 1 · 2019-06-26

Is there some way to normalize the number of reads mapped to each COG taken this information into account?

For COGs representing common protein functions, there will be multiple hits within a genome. If you look at a COG0500 right next to the one you pointed out, it says SAM-dependent methyltransferase (60407 genes in 4956 species). Taken at face value, this means that on average there are more than 10 hits per genome to this COG. It doesn't necessarily mean that there are 10+ methyltransferases (MTRs) that are actually members of COG0500, but it does mean that there are 10+ MTRs of some kind that match COG0500 to some degree. Anyway, without knowing the exact number of potential matches per each individual genome to each COG, I don't think one can use this information for accurate normalization. It may work as a rough estimate, though.