Question

command-line tool to split genome FASTA into equal chunks?

0

Entering edit mode

5.6 years ago

gtrwst9 • 0

Say I have the file for accession LS483306.1 which is one big sequence starting with

>LS483306.1 xyz
AGCT...

and want to have one file with N sequences of size X, looking like this (example X=2000):

>LS483306.1:1-2000 xyz
AGCT...
>LS483306.1:2001-4000 xyz
GCTA...
>LS483306.1:4001-6000 xyz
CTGA...

and so on.

Is there a ready-made command-line tool for this? Which? I could write a BioPython script but I would like something faster.

software conversion fasta • 1.7k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 5.6 years ago by gtrwst9 • 0

0

Entering edit mode

BioPython script but I would like something faster

then write it in plain python

ADD REPLY • link 5.6 years ago by piet ★ 1.8k

score 3 · Accepted Answer · 2018-09-08

shred.sh from BBMap suite.

Usage:  shred.sh in=<file> out=<file> length=<number> minlength=<number> overlap=<number>


in=<file>     Input sequences.
out=<file>    Destination of output shreds.
length=500    Desired length of shreds.
minlength=1   Shortest allowed shred.  The last shred of each input sequence may be shorter than desired length.
overlap=0     Amount of overlap between successive reads.
reads=-1      If nonnegative, stop after this many input sequences.
equal=f       Shred each sequence into subsequences of equal size of at most 'length', instead of a fixed size.