Subsampling a taxa abundance matrix
5.0 years ago
I have an abundance matrix for taxonomic composition from large numbers of shotgun metagenomes that had a sequence range from 5 million to 99 million. Here is the test raw abundance data of these taxa for 4 samples.

Sample_ID total_sequences Escherichia Pseudomona Bacillus Salmonella   Yersinia  Klesiella
sample1   13,000,000 8    13   6    13   32    0     28
sample2   60,000,000 31  25   0      0   25   19      0
sample3    5,000,000 0    0   9     51    0     0    40
sample4   99,000,000 27   19  0     0    22   32      0

I Want to subsample these raw abundance matrix data to 5 million reads and get a new subsamples-abundance matrix. I thought to subsample the first 5 million reads or randomly selected 5 million reads using Heng Li's seqtk and then run those 5 million reads for taxonomic abundance. But that's a time consing process to rerun so many metagenomes again using 5 million reads this time, so I don't want to do that. Can I just calculate a revised taxonomic abundance for 5 million reads for each sample from the matrix that I already have by using this simple calculation.

revised count = raw count/total sequences * 5,000,000

Latest taxa subsampling • 768 views

