Biostar Beta. Not for public use.
Binning fasta sequences by size?
0
Entering edit mode
18 months ago
stacy734 • 40

Hi everyone,

I have a large fasta file and need to bin it by size: everything 200 nt and up in one file, and everything 199 or smaller in another. I have found some useful scripts that will remove the smaller sequences, but they are discarded and I need to keep them in a separate file.

Can anyone suggest a perl script (or anything similar) that will sort out sequences into two files by size cutoff?

Thanks in advance for any advice.

Stacy

ADD COMMENTlink
0
Entering edit mode

Thanks to you both!

ADD REPLYlink
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink
4
Entering edit mode
20 months ago
st.ph.n ♦ 2.5k
Philadelphia, PA

Linearize FASTA if you have multiple lines of sequences between each header.

#!/usr/bin/env python

import sys
lt200 = open(sys.argv[2])
gt200 = open(sys.argv[3])

with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith(">"):
            header = line.strip()
            seq = next(f).strip()
            if len(seq) >= 200:
                gt200.write(header + '\n' + seq)
            else:
                lt200.write(header + '\n' + seq)

lt200.close()
gt200.close()

save as bin_by_len.py, run as python bin_by_len.py input.fasta lt200.fasta gt200.fasta.

ADD COMMENTlink
3
Entering edit mode
4 weeks ago
genomax 68k
United States

Use reformat.sh from BBMap suite.

reformat.sh in=your_file.fa minlen=200 out=more_than_200.fa
reformat.sh in=your_file.fa maxlen=200 out=less_than_200.fa
ADD COMMENTlink
2
Entering edit mode
19 months ago
France/Nantes/Institut du Thorax - INSE…

linearize and dispatch with awk:

rm -f file1.fa file2.fa
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa   |\
awk -F '\t' '{out=length($2)<200?"file1.fa":"file2.fa";printf(">%s\n%s\n",$1,$2) >> out;}'
ADD COMMENTlink
1
Entering edit mode
18 months ago
Cambridge, MA

The faidx utility in pyfaidx has this capability built-in:

$ pip install pyfaidx
$ faidx --size-range 1,199 file.fa > smalls.fa
$ faidx --size-range 200,200000000 file.fa > bigs.fa

An advantage of this approach is that you read the sizes from the .fai index file each time instead of computing the sequence lengths, so for lots of bins this will be much faster.

ADD COMMENTlink
0
Entering edit mode
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1