Biostar Beta. Not for public use.
How to extract only sequences that contain a specific percentage of Cysteines (C)
0
Entering edit mode
12 months ago
kamel • 10

Hello Biostar
I have a multifasta file that contains protein sequences (of different sizes), I want to extract only those sequences that contain at least 5% of cysteines (C).
Can you give me a little script so that I can extract these sequences
Thank you in advance

ADD COMMENTlink
1
Entering edit mode

See this updated post:

https://web.archive.org/web/20071027112709/http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/

Let's use Python:

make your sequences looks like 2 strings each:

> header
1-line sequence

def c_part_in_string(s): 
    countC = 0

    for i in range (0, len(s)):
        if s[i] == "C":
            countC += 1
    return countC/len(s)

    for line in my_list:
        if line.startswith('>'):
            print(line)
        else:
            if c_part_in_string(line) > 0.05: 
                print(line)

This is just an idea.

ADD REPLYlink
0
Entering edit mode

Sorry, we cannot "give you a little script". Please tell us what you've tried so far and where you're facing difficulties, and we can help you with specifics.

ADD REPLYlink
0
Entering edit mode

I can quantify the number of C (cystein) by grep but in each sequence (ID). I have several protein sequences in a single multifasta file, I am looking for an idea to quantify the number of C in each sequence at a time, for extract the sequences which contains 5% cysteine.

ADD REPLYlink
2
Entering edit mode
11 months ago
JC 7.9k
Mexico

Definitively you need some programming skills to do it by yourself. Here is some Perl code to do this:

#!/usr/bin/perl

use strict;
use warnings;
my $minC = 5; # Filter sequence below this percentage

$/ = "\n>"; # Read Fasta sequences in blocks
while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_); # Separate sequence id from the sequence lines
    my $seq = uc (join ("", @seq)); # Linearize the sequence in a single string, also using upper-case letters
    my $len = length $seq; # Get sequence size
    my $numC = $seq =~ tr/C/C/; # Quick way to count chars in a string
    my $perC = 100 * $numC / $len; # Calc percentage of Cs
    if ($perC >= $minC) { # Filter
         print ">$_";
    }
}

Usage: perl filterByC.pl < Fasta_In > Fasta_out

ADD COMMENTlink
0
Entering edit mode

Thanks for the help @JC

ADD REPLYlink
2
Entering edit mode
3 months ago
Germany

With awk:

$ awk -v FS="\n" -v RS=">" '$0 { seq=$0; $1="";  c=gsub("C", "_", $0); l=gsub(/[A-Z_]/, ".", $0); if(c/l>0.05) printf ">"seq}' input.fa
  • We set > as the record separator and \n as a the field separator.
  • If our record is not empty we save the whole record to seq.
  • We remove the id because we don't want to count the C there.
  • Then we count the replacement made from C to _ to get the number of C.
  • To get the total number of amino acid we count the replacements of each letter and a _.
  • If the fraction is greater 0.05 we print out the original sequence.
ADD COMMENTlink
0
Entering edit mode

Thanks for the help @finswimmer this is very useful

ADD REPLYlink
0
Entering edit mode
11 months ago
India

with seqkit and awk:

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=10 {print ">"$1,$2}'
>a
atgccc
>c
ac
>d
c

$ seqkit fx2tab -B "C" test.txt | awk -v OFS="\n" '$3>=100 {print ">"$1,$2}'
>d
c

input:

$cat test.txt 
>a
atgccc
>b
atc
>c
ac
>d
c
ADD COMMENTlink
0
Entering edit mode

Thanks for the help

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1