Substring dereplication of protein sequences
1
1
Entering edit mode
6.4 years ago
smiller ▴ 70

I would like to dereplicate a 3 GB fasta file of amino acid sequences. I would like this to include the removal of shorter sequences found in longer sequences (substring dereplication). The purpose of this is the construction of a smaller database against which I search peptide mass spectra and the identification of abundant sequences in this file.

So far, I have explored prefix-dereplication (slightly different than what I ideally want) in vsearch and substring-dereplication in usearch, but neither is satisfactory. Prefix-dereplication by vsearch does not support protein sequences. Substring dereplication by usearch requires the use of v.5.2. The freely available version of usearch-5.2 has an insufficient memory limit of 2 GB.

Does anyone know of a tool that will suit my needs? Thanks in advance.

dereplication proteomics • 2.5k views
ADD COMMENT
0
Entering edit mode

I am not sure if the substring de-replication is part of it but you can take a look at CD-HIT for this purpose.

ADD REPLY
0
Entering edit mode

cd-hit-dup fails with the message

cd-hit-dup: cdhit-dup.cxx:193: int HashingDepth(int, int): Assertion `len >= min' failed.

This may be due to the fact that I have sequences as short as length 9. Fundamentally, this command is geared toward longer nucleotide sequences. It also does not do any form of substring dereplication.

ADD REPLY
0
Entering edit mode

pir peptide search could be helpful even if it does not answer your question.

ADD REPLY
0
Entering edit mode

User genomax's comment led me to the exact solution that I wanted. Instead of the CD-HIT tool, cd-hit-dup, use instead the tool, cd-hit. This clusters sequences, including subsequences. One can specify a sequence identity of 100%. The following command writes two files: a dereplicated fasta file and a file identifying the sequences in each cluster.

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0
ADD REPLY
0
Entering edit mode
6.4 years ago
smiller ▴ 70

Use CD-HIT for this task.

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0
ADD COMMENT

Login before adding your answer.

Traffic: 2678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6