Question

Clustering PacBio CCS reads

0

Entering edit mode

7.0 years ago

lelle ▴ 830

I have a mock metabarcoding dataset that was sequence with PacBio. A 4.5kb long stretch of DNA was amplified and sequenced with Circular Consensus Sequencing. This means the error rate is around 1%, but a lot of these are indels.

I want to cluster these sequences to create OTUs. What software would you use?

Clustering software I have tired so far and why they did not work (well):

vsearch/usearch: Relies on finding starting points for clusters by sorting reads by abundance. My reads contain enough errors to make almost all of them singletons.
cd-hit: Chooses cluster start by longest sequence. This sometimes choses the wrong starting points creating two clusters where it should be one.
swarm: only works for low error rates
HPC-clust: needs a multiple sequence alignment of all sequences. This is not feasible because of high error rate and high biological variability.

PacBio clustering • 2.8k views

ADD COMMENT • link 7.0 years ago by lelle ▴ 830

0

Entering edit mode

Vmatch works well for clustering, but not sure about the indel problem though. Since indels are more common, you could reduce the indel penalty and reduce the identity % based on the distribution of indels

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Thanks for the suggestion. It seems the clustering would be based on local alignments/matches. I need something that takes into account the full sequences. I would also prefer to use free (as in freedom) or at least open source software.

ADD REPLY • link 7.0 years ago by lelle ▴ 830

0

Entering edit mode

Vmatch can cluster to keep only the longest from the subset (-nonredundant option), also the license is free for academic use and easy to obtain

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

I don't think the -nonredundant option would address my problem. I want the sequences clustered by full length similarity. Unless I misunderstand the documentation (I did not read all of it), vmatch searches for local matches and clusters by that. If I understand correctly it might be possible to force full length matches by setting the psmal parameter of the -dbcluster option to 100. But at that point it just feels like I am forcing the tool to do something it was not designed for.

ADD REPLY • link 7.0 years ago by lelle ▴ 830

0

Entering edit mode

dbcluster 100 would be the best option as you said. The local matches are chained in the later steps, global alignments are more forced than local ones which are chained

ADD REPLY • link 7.0 years ago by Rohit ★ 1.5k