Biostar Beta. Not for public use.
CD-HIT Clustering output
0
Entering edit mode
6.6 years ago
BSP • 0
@BSP12932

Hello,

currently I am analysing several metagenome and transcriptome datasets. For downstream analysis I am clustering these datasets with CD-HIT. Unfortunately CD-HIT produces no default output with an overview of sequence abundance within each cluster (simple tab delimited file). I found an old script which grabs this information but it uses the .bak.clstr output which was only available till version 4.3. Is there a way to produce this output in the newest version or an alternative way to produce a simple sequence abundance to cluster name overview?

Sincerely,

Felix

CD-HIT Clustering • 4.5k views
ADD COMMENTlink
0
Entering edit mode

By sequence abundance do you mean how many times the duplicate sequence is found in the cluster?

ADD REPLYlink
0
Entering edit mode

Yes, exactly!

Zitat von Prakki Rama on Biostar notifications@biostars.org:

ADD REPLYlink
0
Entering edit mode

Check if this useful. (assuming no empty lines between Cluster 0 and Cluster 1)

My input:

>Cluster 0
0       15679nt, >SB1234_Contig35475... at +/99.99%
1       15436nt, >SB1234_Contig35476... at +/99.62%
2       15764nt, >Contig18540... *
3       15438nt, >Contig39392... at +/99.69%
4       15679nt, >comp263440_c8_seq4... at -/99.99%
5       15667nt, >comp263440_c8_seq6... at -/99.99%
>Cluster 1
0       15684nt, >SB1234_Contig35474... at +/99.98%
1       15685nt, >Contig11682... *
2       15684nt, >comp263440_c8_seq3... at -/99.98%
3       15672nt, >comp263440_c8_seq5... at -/99.97%

My script

$TotalClusters=`grep -c '>Cluster' cd-hit.test.txt`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i/../^>Cluster $j/);}' cd-hit.test.txt | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

OUTPUT

Cluster 0 has 6 sequences
Cluster 1 has 4 sequences

Is this what you wanted?

ADD REPLYlink
0
Entering edit mode

Dear Prakki, I wonder if you have a solution if within each Cluster are different individuals like this:

>Cluster 0
0       15679nt, >SpecA_Contig35475... at +/99.99%
1       15436nt, >SpecA_Contig35476... at +/99.62%
2       15764nt, >SpecB_Contig18540... *
3       15438nt, >SpecA_Contig39392... at +/99.69%
4       15679nt, >SpecC_comp263440_c8_seq4... at -/99.99%
>Cluster 1
0       15684nt, >SpecC_SB1234_Contig35474... at +/99.98%
1       15685nt, >SpecC_Contig11682... *
>Cluster 2
0       15684nt, >SpecA_comp263440_c8_seq3... at -/99.98%
1       15672nt, >SpecB_comp263440_c8_seq5... at -/99.97%

and I would like to find out how many Clusters own only one, two or all three species as it is the case in Cluster 0?

Thanks a lot!

ADD REPLYlink
0
Entering edit mode

the above script should serve your purpose!

ADD REPLYlink
0
Entering edit mode

Is this supposed to run in bash? I get the following output...

./cdhit_parser.sh: line 2: =291: command not found

./cdhit_parser.sh: line 3: syntax error near unexpected token `('

./cdhit_parser.sh: line 3: `for($i=0;$i<$TotalClusters;$i++)'

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3