Delete fasta sequence with a pattern "unassigned peptidases"
3
0
Entering edit mode
7.1 years ago
ayaou • 0

Hello,

I'm searching for a "awk command" to delete all sequences with the pattern "unassigned peptidases" from a file. Here is an example of my data :

>MER0206023 - subfamily S1C unassigned peptidases (Abiotrophia defectiva) [S01.UPC]#S01C#{peptidase unit: 129-408}
GSGIIVDNDDKSLFIITNNHVIEGAKHLMVAFEDGTTAKGEVRGTAAYTDLALVEVKLSE
LDKKVINKIKVAKIGDSDGLKVGQMVMAIGNALGYGQSLTVGYVSARDRIVTVNDITMKL
IQTDAAINPGNSGGALLNLNGEVVGINSVKFSSRAIEGMGYAIPMATVKPLINELKSSKH
LTDTERGYLGIFYREIDDSTHEAFNLPYGLYISDVAKNGGAEKAGLLKGDIIIGLNDNET
LKSDAINSIILGKRKGDKVKVTFYRYENGEYVKHEVTVTL
sequence • 1.5k views
ADD COMMENT
2
Entering edit mode
7.1 years ago
 awk '/^>/ {OK=index($0,"unassigned peptidases")==0;} {if(OK) print;}' input.fa
ADD COMMENT
0
Entering edit mode

can you explain the command ?

ADD REPLY
2
Entering edit mode
  1. /^>/ {OK=index($0,"unassigned peptidases")==0;}, for a line starting with > (FASTA header line), if it does not contain unassigned peptidases, mark OK as true , and print this line ({if(OK) print;}).
  2. for next lines not starting with > (sequence line), if OK is true, print these lines.
  3. go to Step 1.
ADD REPLY
0
Entering edit mode

Thanks Pierre for showing the power of shell commands. I'm just curious can seqkit reach the speed of awk, but the result makes me surprised and doubtful:

Tests are performed on a fasta file of 2.7G on my laptop with i5 2-cores/4-threads, SSD.

$ seqkit stat dataset_A.fa -a
file          format  type  num_seqs        sum_len  min_len   avg_len    max_len  sum_gap      N50    L50
dataset_A.fa  FASTA   DNA     67,748  2,807,643,808       56  41,442.5  5,976,145        0  210,659  2,477

awk version:

$ time awk '/^>/ {OK=index($0,"whole genome")==0;} {if(OK) print;}' dataset_A.fa > t.awk      
awk '/^>/ {OK=index($0,"whole genome")==0;} {if(OK) print;}' dataset_A.fa >   20.27s user 0.49s system 99% cpu 20.840 tota

seqkit version.

$ time seqkit grep -n -r -p 'whole genome' -v dataset_A.fa > t.seqkit            
seqkit grep -n -r -p 'whole genome' -v dataset_A.fa > t.seqkit  1.87s user 0.54s system 99% cpu 2.409 total

The results are the same:

$ md5sum t.awk t.seqkit 
c592747347ea4ea2e624a03d3360c01a  t.awk
c592747347ea4ea2e624a03d3360c01a  t.seqkit
$ seqkit stat t.awk 
file   format  type  num_seqs      sum_len  min_len      avg_len    max_len
t.awk  FASTA   DNA         54  131,799,401    3,624  2,440,729.6  5,976,145
$ ls -lh t.awk
-rw-rw-r--. 1 shenwei shenwei 128M Mar 16 14:45 t.awk

awk is 10x slower??? And I also tried run su -c 'free && sync && echo 3 > /proc/sys/vm/drop_caches && free' before run every tests to drop page cache.

ADD REPLY
2
Entering edit mode

awk is 10x slower?

of course. awk is a scripting language . seqkit is compiled and specialized for the job.

ADD REPLY
0
Entering edit mode
7.1 years ago

using seqkit

seqkit grep -n -r -p 'unassigned peptidases' -v seqs.fa > new.fa
ADD COMMENT
0
Entering edit mode
7.1 years ago
$ pip install pyfaidx
$ faidx sequences.fa --regex '.*unassigned peptidases.*' --invert-match > no_peptidases.fa

You can find more usage for faidx here: https://github.com/mdshw5/pyfaidx#faidx

ADD COMMENT
0
Entering edit mode

You forget the option -f:

-f, --full-names      output full names including description. default:
                        False

Even using it, it still does not work, since faidx uses FASTA index file but descriptions are not in it.

PS: I'm using pyfaidx==0.4.8.2

Test:

$ faidx t.fa
>seq
actg
$ cat t.fa.fai 
seq     4       11      4       5
$ faidx t.fa --regex s
>seq
actg
$ faidx t.fa --regex o
$ faidx t.fa --regex o -f
$

# ----

rm t.fa.fai
$ faidx t.fa -f --regex s
>seq hello
actg
$ faidx t.fa -f --regex o
$
ADD REPLY

Login before adding your answer.

Traffic: 2067 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6