Question

Tool:Sequence Dereplication and Database Curator Program (SDDC) v2.0

0

Entering edit mode

7.1 years ago

Eslam Samir ▴ 110

Sequence Dereplication and Database Curator Program (SDDC)

https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/releases/tag/v2.0

This program can : dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).

Version 2.0 Updates:

You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.
You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.

How to use:

You need to install python 2.7 or python 3 on your machine.
You need to install Numpy and Biopython
You need to install future module by pip command
Click “Clone or download” > “Download ZIP” > extract the downloaded file.
Open the file “sddc.py” with (python.exe).
- Windows
- U/Linux : use the command chmod u+x sddc.py
- Mac : use the command python sddc.py
State your variables and press Enter.

List of options in the program you can download it from here:

Examples

if you want to dereplicate protein sequences use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep

if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order

if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi

if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300

if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive

if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw

if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive

if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw

if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)

proteomics genetics metagenomics genomics • 2.5k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 7.1 years ago by Eslam Samir ▴ 110

0

Entering edit mode

6.2 years ago

Eslam Samir ▴ 110

Thank you so much for your comment.

For the option of counting sequences, it can be detected but in indirect way. Please, check the SDDC Cheat sheet. This point is covered in page 2 (grey box).

For the partial (fragmented) sequences, I tested the program against a synthetic database and a real database for both proteins and nucleotide sequences including partial (fragmented) sequences and the sensitivity and specificity were 100%.

Thanks again for your comment. I am willing to know if there is anything else needed to make this tool more comfortable for usage.

ADD COMMENT • link 6.2 years ago by Eslam Samir ▴ 110

score 1 · Accepted Answer · 2018-02-07

1

Entering edit mode

6.2 years ago

Eskimo ▴ 120

This is a really nice little program to play around with and I am finding lots of interesting uses for it.

However, I am wondering if you could modify it to include a count function so that the number of duplicates of each sequence could be included in the fasta header for a given sequence?

Can you also confirm that fragmented sequences (i.e. sequences that are shorter than a parent sequence but still retain 100% identical match) are removed?

ADD COMMENT • link 6.2 years ago by Eskimo ▴ 120

0

Entering edit mode

Thank you so much for your comment.

For the option of counting sequences, it can be detected but in indirect way. Please, check the SDDC Cheat sheet. This point is covered in page 2 (grey box).

For the partial (fragmented) sequences, I tested the program against a synthetic database and a real database for both proteins and nucleotide sequences including partial (fragmented) sequences and the sensitivity and specificity were 100%.

Thanks again for your comment. I am willing to know if there is anything else needed to make this tool more comfortable for usage.

ADD REPLY • link 6.2 years ago by Eslam Samir ▴ 110