Question

removed entrie sequence include specific letters

0

Entering edit mode

5.3 years ago

Jason ▴ 10

Hey All

I need help with shell command or Perl script.

I have 400 sequences and some of these sequences have X,B and Z.

I want to remove an entire sequence from fasta file that has X, B,Z

I found this shell command that will remove only this letter from the sequence

sed '/^[^>]/s/[X||Z|B]//g' input_file.fasta > output_file.fasta

But my goal to remove any sequence include these letters.

All my sequences in one line

for example:

>sp|Q9M7X9|CITRX_ARATH Thioredoxin-like protein CITRX, chloroplastic OS=Arabidopsis thaliana OX=3702 GN=CITRX PE=1 SV=1
MALVQSRTFPHLNTPLSPILSSLHAPSSLFIXREIRPVAAPXXSSTAGNLPFSPLTRPRKLLCPPPRGKFVREDYLVKKLSAQELQELVKGDRKVPLIVDFYATWCGPCILMAQELEMLAVEYESNAIIVKVDTDDEYEFARDMQVRGLPTLFFISPDPSKDAIRTEGLIPLQMMHDIIDNEM
>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV
>sp|Q99MD6|TRXR3_MOUSE Thioredoxin reductase 3 OS=Mus musculus OX=10090 GN=Txnrd3 PE=1 SV=3
MEKPPSPPPPPRAQTSPGLGKVGVLPNRRLGAVRGGLMSBBRRARLASPGTSRPSSEAREELRRRLRDLIEGNRVMIFSKSYCPHSTRVKELFSSLGVVYNILELDQVDDGASVQEVLTEISNQKTVPNIFV

the result will remove entire sequences include X, B, and Z

>sp|P22217|TRX1_YEAST Thioredoxin-1 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX1 PE=1 SV=3 
MVTQFKTASEFDSAIAQDKLVVVDFYATWCGPCKMIAPMIEKFSEQYPQADFYKLDVDELGDVAQKNEVSAMPTLLLFKNGKEVAKVVGANPAAIKQAIAANA
>sp|P22803|TRX2_YEAST Thioredoxin-2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) OX=559292 GN=TRX2 PE=1 SV=3 
MVTQLKSASEYDSALASGDKLVVVDFFATWCGPCKMIAPMIEKFAEQYSDAAFYKLDVDEVSDVAQKAEVSSMPTLIFYKGGKEVTRVVGANPAAIKQAIASNV

sequencing • 1.3k views

ADD COMMENT • link updated 5.3 years ago by shenwei356 8.4k • written 5.3 years ago by Jason ▴ 10

1

Entering edit mode

Will you remove sequences containing J?

    A   Ala Alanine
    B   Asx Aspartic acid or Asparagine [2]
    C   Cys Cysteine
    D   Asp Aspartic Acid
    E   Glu Glutamic Acid
    F   Phe Phenylalanine
    G   Gly Glycine
    H   His Histidine
    I   Ile Isoleucine
    J       Isoleucine or Leucine [4]
    K   Lys Lysine
    L   Leu Leucine
    M   Met Methionine
    N   Asn Asparagine
    O       pyrrolysine [6]
    P   Pro Proline
    Q   Gln Glutamine
    R   Arg Arginine
    S   Ser Serine
    T   Thr Threonine
    U   Sec selenocysteine [5,6]
    V   Val Valine
    W   Trp Tryptophan
    Y   Tyr Tyrosine
    Z   Glx Glutamine or Glutamic acid [2]
    X   unknown amino acid
    .   gaps
    *   End
Reference:
    1. http://www.bioinformatics.org/sms/iupac.html
    2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html
    3. http://www.bioinformatics.org/sms2/iupac.html
    4. http://www.matrixscience.com/blog/non-standard-amino-acid-residues.html
    5. http://www.sbcs.qmul.ac.uk/iupac/AminoAcid/A2021.html#AA21
    6. https://en.wikipedia.org/wiki/Amino_acid

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

score 1 · Answer 1 · 2019-01-01

You don't need a substitution, search for a match in your sequence:

#!/usr/bin/perl

use strict;
use warnings;

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    next if ($seq =~ m/[ZXB]/); # skip sequences with Z, X or B
    print ">$_";
}

Usage: perl removeSeqs.pl < FASTA_IN > FASTA_OUT

score 0 · Answer 2 · 2019-01-02

0

Entering edit mode

5.3 years ago

shenwei356 8.4k

Try seqkit grep (usage).

seqkit grep -i -s -r -p '[zxb]' -v

# cat test.fa | seqkit grep --ignore-case --by-seq --use-regexp --pattern '[zxb]' --invert-match

#  seqkit grep -i -s -p z -p x -p b -v

ADD COMMENT • link 5.3 years ago by shenwei356 8.4k