How to get a FASTA file where you filter by length to exclude sortest and longest 10%
0
0
Entering edit mode
5.2 years ago
namarino • 0

Hello,

I have a pretty big data FASTA file with protein sequences. I would like to filter it by length so that it excludes sequences that are at the ends (ie: Cut shortest 10% and the longest 10%). I've looked through the forums, but only see ones where people filter by specific length

So far, I've tried:

awk '{/>/&&++a||b+=length()}END{print b/a}' uniprot_input.fasta

From Mean Length Of Fasta Sequences

// The average length is 216.817

However, I'm really new at using the command line, awk, and handling files. Thanks in advance for your help!

alignment sequence fasta linux awk • 3.0k views
ADD COMMENT
0
Entering edit mode

Why don't you try to write it yourself for practice. First, go through the file to get a summary of the lengths. Then calculate what the shortest 10% and longest 90% are and feed this in a new awk command with appropriate if statements, only selecting sequences above the 10% and below the 90%. I am sure you can get this done, and it will help you solve these things in the future.

ADD REPLY
0
Entering edit mode

you can use seqkit to filter sequences by min and maximum length... namarino

ADD REPLY
0
Entering edit mode

True, but how do you calculate 10th and 90th percentile of lengths?

ADD REPLY
1
Entering edit mode

It's simple: sorting by length, calculating q10 and q90, and retrieving seqs in this range.

# Step 1
#  Getting total number
n=$(seqkit stats test.fa -T | sed 1d | cut -f 4)

# Step 2, option A
# "seqkit sort -l" needs unique headers.
# "seqkit range" retrieve records in a range.
seqkit rename test.fa | seqkit sort -l \
    | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa

# Step 2, option B. Use this ↓↓↓
# another version keeping original FASTA header
seqkit fx2tab -l test.fa | sort -k4,4n | seqkit tab2fx \
    | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa

Tests:

  1. Generating test dataset: 100000 seqs in length of 1 to 100bp

    $ cat gen-rand-seqs.py 
    #!/usr/bin/env python
    import numpy as np
    
    for n in ( x * 100 for x in np.random.normal(0.5, 0.1, 100000) ):
        print(">{}\n{}".format(int(n), "A"*int(n)))
    
    #normal distribution
    python gen-rand-seqs.py > test.fa
    
    # uniform distribution
    perl -e 'for (1..100000){$n=rand(100)+1; printf(">%d\n%s\n", $n, "A" x $n);}' > test.fa
    
    # plot length distribution
    seqkit fx2tab -l test.fa | cut -f 4 | csvtk plot hist -H --x-min 0 --x-max 100 --bins 50  > 1.png
    
    # summary
    seqkit stats test.fa
    file     format  type  num_seqs    sum_len  min_len  avg_len  max_len
    test.fa  FASTA   DNA    100,000  4,953,319        8     49.5       99
    
  2. Calculating q10 and q90, then choosing target range:

    n=$(grep -c '^>' test.fa)
    
    echo $(($n/10+1)):$(($n*9/10))
    10001:90000
    
  3. Sorting and retrieving

    seqkit fx2tab -l test.fa | sort -k4,4n | seqkit tab2fx \
        | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa
    
    # length distribution
    seqkit fx2tab -l result.fa | cut -f 4 | csvtk plot hist -H --x-min 0 --x-max 100 --bins 40 > 2.png
    
  4. Length distribution (uniform distribution)

    Before

    After

  5. Length distribution (normal distribution)

    Before http://i66.tinypic.com/2r6l2kh.png , After http://i65.tinypic.com/5oawpw.png

ADD REPLY
0
Entering edit mode

Thank you! I hadn't been much aware of seqkit, but will definitely start implementing it.

ADD REPLY
1
Entering edit mode

seems datamash can with perc operation. https://www.gnu.org/software/datamash/manual/datamash.html

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

Thank you! It was my first post, so I will definitely apply this in the future

ADD REPLY

Login before adding your answer.

Traffic: 2057 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6