Sequence filtration Program

Question

Tool:Sequence filtration tool

1

Entering edit mode

7.1 years ago

Eslam Samir ▴ 110

Sequence filtration Program

Sequence database curator

https://github.com/Eslam-Samir-Ragab/Sequence-database-curator

This program can filter nucleotide and/or protein database from a list of names or sequences (by exact match).

Input:

File containing all the sequences in FASTA format.

Processing:

It removes specific sequences in your database by your choice.

Output:

One file of your chosen name.

Options:

Working on either protein (p) or nucleotide (n) databases.

How to use:

You need to install python 2.7 or python 3 on your machine.
You need to install Numpy and Biopython
You need to install future module by pip command
Click “Clone or download” > “Download ZIP” > extract the downloaded file.
Open the file “sequence_filteration.py” with (python.exe).
- Windows
- U/Linux : use the command chmod u+x database_curator.py
- Mac : use the command python sequence_filteration.py
State your variables and press Enter.

List of options in the program are summarized in the Read Me file

Examples

if you want to process a nucleotide sequences use the following command

python sequence_filteration.py -in (input_file) -n -out (output_file) -filter (filter_file) -flt_mode seq

if you want to process a protein sequences with optimum length approach use the following command

python sequence_filteration.py -in (input_file) -p -out (output_file) -filter (filter_file) -flt_mode seq

if you want to process a nucleotide sequences with a file containing list of exact names use the following command

python sequence_filteration.py -in (input_file) -n -out (output_file) -filter (filter_file) -flt_mode name

Related:

Sequence Database curator tool

fasta sequence filter • 2.8k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 7.1 years ago by Eslam Samir ▴ 110

3

Entering edit mode

While this is a basic tool, you have made huge improvements in your coding style and layout in less than a month. I'm seriously deeply impressed. I hope you're enjoying it too because there's probably still a long way to go. But yeah, you obviously have a lot of passion for this, and like I said your coding style is great. Please keep it up! :)

My only advice would be to try and find a more interesting problem to solve for your next one. I don't know what your situation at work is like, but if you can find new problems to solve by asking the people around you what tools they would want in an ideal world, then that could be the perfect seed for an idea for your next tool :)

ADD REPLY • link 7.1 years ago by John 13k

3

Entering edit mode

@Eslam, you did an great job of describing the installation and usage of your tool, complete with a diagram. I would suggest, though, that the diagram seems a bit unrelated to names, which is a little confusing. Also, it would help if you could describe in more detail the meaning of the flags - for example, does "-filter" specify an output file or an input file? Formatting-wise, this is an excellent post.

I suggest you modify the tool slightly to allow users to invert the selection and filter either inclusively or exclusively of the list, or else, provide two output streams, one for sequences matching the filter, and one for sequences not matching the filter. I find that in practice splitting based on name tends to be useful.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Dr @Brian, Thanks for clearing some of the defects of the previous tool. Here is the next release Sequence Dereplication and Database Curator Program (SDDC) I hope it will be satisfactory for this job to move for more realistic problems :) .

ADD REPLY • link 7.1 years ago by Eslam Samir ▴ 110

0

Entering edit mode

Hi Eslam,

I recommend that you update your original post - right now it says:

if you want to process a nucleotide sequences with a file containing list of exact names use the following command

python sequence_filteration.py -in (input_file) -n -out (output_file) -filter (filter_file) -flt_mode name

...which is great - I love command lines that are easy to follow, rather than regexes and similar. For example:

bushnell@mc1629:/global/projectb/sandbox/gaag/bbtools/nt$ grep
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
bushnell@mc1629:/global/projectb/sandbox/gaag/bbtools/nt$ grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c

Regexp selection and interpretation:
  -E, --extended-regexp     PATTERN is an extended regular expression (ERE)
  -F, --fixed-strings       PATTERN is a set of newline-separated fixed strings
  -G, --basic-regexp        PATTERN is a basic regular expression (BRE)
  -P, --perl-regexp         PATTERN is a Perl regular expression
  -e, --regexp=PATTERN      use PATTERN for matching
  -f, --file=FILE           obtain PATTERN from FILE
  -i, --ignore-case         ignore case distinctions
      --include=FILE_PATTERN  search only files that match FILE_PATTERN
      --exclude=FILE_PATTERN  skip files and directories matching FILE_PATTERN
      --exclude-from=FILE   skip files matching any file pattern from FILE
      --exclude-dir=PATTERN  directories that match PATTERN will be skipped.
  -L, --files-without-match  print only names of FILEs containing no match
  -l, --files-with-matches  print only names of FILEs containing matches
  -c, --count               print only a count of matching lines per FILE
  -T, --initial-tab         make tabs line up (if needed)
  -Z, --null                print 0 byte after FILE name

Context control:
  -B, --before-context=NUM  print NUM lines of leading context
  -A, --after-context=NUM   print NUM lines of trailing context
  -C, --context=NUM         print NUM lines of output context
  -NUM                      same as --context=NUM
      --color[=WHEN],
      --colour[=WHEN]       use markers to highlight the matching strings;
                            WHEN is `always', `never', or `auto'
  -U, --binary              do not strip CR characters at EOL (MSDOS)
  -u, --unix-byte-offsets   report offsets as if CRs were not there (MSDOS)

`egrep' means `grep -E'.  `fgrep' means `grep -F'.
Direct invocation as either `egrep' or `fgrep' is deprecated.
With no FILE, or when FILE is -, read standard input.  If less than two FILEs
are given, assume -h.  Exit status is 0 if any line was selected, 1 otherwise;
if any error occurs and -q was not given, the exit status is 2.

Report bugs to: bug-grep@gnu.org
GNU Grep home page: <http://www.gnu.org/software/grep/>
General help using GNU software: <http://www.gnu.org/gethelp/>

So, that's a lot of information. I still have no idea what "grep [OPTION]... PATTERN [FILE]..." means, because it's some custom language understood by the developer, and not the users. Why are there ellipses? What do the brackets mean? Why does OPTION have brackets but not PATTERN? Basically, with this kind of help, without any examples of the kind of things a typical user might type, it is impossible to use without extensive trial and error. Which basically negates the purpose of documentation.

So - to make your tool friendly, I suggest you keep the documentation up to date with the current feature-set, and also write documentation such that anyone who reads it will be able to use the tool correctly - with a goal of using it correctly on the first try, even for people that are not Linux-proficient.

Note - I deleted random lines from grep's output to fit biostars' line limit.

ADD REPLY • link 7.1 years ago by Brian Bushnell 20k

1

Entering edit mode

Dr @John Thanks a lot for all of this precious advice. I'm now working on something that I hope it will have better impact than this simple idea :)

ADD REPLY • link 7.1 years ago by Eslam Samir ▴ 110