Question

Binning fasta sequences by size?

0

Entering edit mode

6.3 years ago

stacy734 ▴ 40

Hi everyone,

I have a large fasta file and need to bin it by size: everything 200 nt and up in one file, and everything 199 or smaller in another. I have found some useful scripts that will remove the smaller sequences, but they are discarded and I need to keep them in a separate file.

Can anyone suggest a perl script (or anything similar) that will sort out sequences into two files by size cutoff?

Thanks in advance for any advice.

Stacy

fasta • 3.2k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 6.3 years ago by stacy734 ▴ 40

0

Entering edit mode

Thanks to you both!

ADD REPLY • link 6.3 years ago by stacy734 ▴ 40

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

1

Entering edit mode

6.3 years ago

Matt Shirley 10k

The faidx utility in pyfaidx has this capability built-in:

$ pip install pyfaidx
$ faidx --size-range 1,199 file.fa > smalls.fa
$ faidx --size-range 200,200000000 file.fa > bigs.fa

An advantage of this approach is that you read the sizes from the .fai index file each time instead of computing the sequence lengths, so for lots of bins this will be much faster.

ADD COMMENT • link 6.3 years ago by Matt Shirley 10k

0

Entering edit mode

See also A: Extract subset sequences from fasta file

ADD REPLY • link 6.3 years ago by Matt Shirley 10k

score 4 · Accepted Answer · 2018-01-09

Linearize FASTA if you have multiple lines of sequences between each header.

#!/usr/bin/env python

import sys
lt200 = open(sys.argv[2])
gt200 = open(sys.argv[3])

with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith(">"):
            header = line.strip()
            seq = next(f).strip()
            if len(seq) >= 200:
                gt200.write(header + '\n' + seq)
            else:
                lt200.write(header + '\n' + seq)

lt200.close()
gt200.close()

save as bin_by_len.py, run as python bin_by_len.py input.fasta lt200.fasta gt200.fasta.

score 3 · Accepted Answer · 2018-01-09

3

Entering edit mode

6.3 years ago

GenoMax 141k

Use reformat.sh from BBMap suite.

reformat.sh in=your_file.fa minlen=200 out=more_than_200.fa
reformat.sh in=your_file.fa maxlen=200 out=less_than_200.fa

ADD COMMENT • link 6.3 years ago by GenoMax 141k

score 2 · Accepted Answer · 2018-01-09

2

Entering edit mode

6.3 years ago

Pierre Lindenbaum 161k

linearize and dispatch with awk:

rm -f file1.fa file2.fa
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa   |\
awk -F '\t' '{out=length($2)<200?"file1.fa":"file2.fa";printf(">%s\n%s\n",$1,$2) >> out;}'

ADD COMMENT • link 6.3 years ago by Pierre Lindenbaum 161k