Biostar Beta. Not for public use.
Question: How To Split A Multiple Fasta
14
Entering edit mode

How to split a Multiple fasta file into separate files having almost similar file size as specified? Do you have any tool for that? But the tool shouldn't split individual fasta entry

Gvj

ADD COMMENTlink 9.5 years ago Gvj • 440 • updated 3.2 years ago Brian Bushnell 16k
14
Entering edit mode

I suggest the fastasplitn command. It works like a charm for me. pyfasta has similar functionality:

# split a fasta file into 6 new files of relatively even size:
pyfasta split -n 6 original.fasta
ADD COMMENTlink 9.5 years ago Aaronquinlan 11k • updated 15 months ago RamRS 21k
8
Entering edit mode

The most efficient I know is a GenomeThreader tool (EDIT: genome tools actually):

If you want to constrain the number of output files (here 60):

gt splitfasta -numfiles 60 seqs.fasta

If you want to constrain the size in MB (here 20) of each output file:

gt splitfasta -targetsize 20 seqs.fasta
ADD COMMENTlink 8.1 years ago Manu Prestat 3.9k • updated 15 months ago RamRS 21k
Entering edit mode
0

The gt program is actually part of the GenomeTools library. The GenomeThreader tool (gth on the command line) is developed by the same group but is used for spliced alignment.

ADD REPLYlink 8.1 years ago
Daniel Standage
3.9k
Entering edit mode
0

You are right ;-) I edit the answer now.

ADD REPLYlink 8.1 years ago
Manu Prestat
3.9k
Entering edit mode
0

6 years after: Thanks Manu, that one is fast !

ADD REPLYlink 2.0 years ago
jnesme
• 0
7
Entering edit mode

There is an alternative to brent's pyfasta, using Kent's src.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each,
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.
ADD COMMENTlink 9.5 years ago Haibao Tang 3.0k • updated 15 months ago RamRS 21k
Entering edit mode
0

How can I make faSplit run. I have downloaded the jksrc folder and I can see ksrc/src/utils/faSplit. but when i make it, it don't provide any exe file.

ADD REPLYlink 9.5 years ago
Gvj
• 440
Entering edit mode
2

If you are on a Linux machine, the pre-compliled files can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

ADD REPLYlink 5.9 years ago
Vivek
♦ 2.3k
Entering edit mode
0

make sure that you compiled successfully, then the binaries can be found in ${HOME}/bin/${MACHTYPE}/. Make sure the folder is there before you compile (see README in the package).

ADD REPLYlink 9.5 years ago
Haibao Tang
3.0k
7
Entering edit mode

I actually wrote one of these yesterday. Here's the code. It's rough, but hopefully understandable - it uses BioPerl. The idea is that you'll specify how many sequences you want per file e.g. ./splitfasta.pl allseqs.fa splitseqs 100 - it'll create splitseqs.1,2,3 etc each with 100 sequences in them.

#!/usr/bin/perl

use strict;
use Bio::SeqIO;

my $from = shift;
my $toprefix = shift;
my $seqs = shift;

my $in  = new Bio::SeqIO(-file  => $from);

my $count = 0;
my $fcount = 1;
my $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
while (my $seq = $in->next_seq) {
        if ($count % $seqs == 0) {
                $fcount++;
                $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
        }
        $out->write_seq($seq);
        $count++;
}
ADD COMMENTlink 9.5 years ago Rob • 140 • updated 15 months ago RamRS 21k
Entering edit mode
0

to get the first file populated, change

if ($count > 0 && $count % $seqnum == 0) {
ADD REPLYlink 2.7 years ago
Stephane Plaisance
• 380
• updated 15 months ago
RamRS
21k
5
Entering edit mode

You can also use a dodgy little bash script in a pinch e.g.:

csplit -z myfile.fas '/>/' '{*}'

http://41j.com/blog/2011/01/split-fasta-file-into-files-with-one-contig-per-file/

ADD COMMENTlink 5.0 years ago new • 50 • updated 15 months ago RamRS 21k
4
Entering edit mode

You could just use csplit in Linux. You can specify how large the files are and use a simple regex to specify what should be at the start of each new file ('>').

http://www.computerhope.com/unix/usplit.htm

ADD COMMENTlink 8.1 years ago Kieren Lythgow • 110
Entering edit mode
1

I knew this one and could be very convenient when I want to do that on an other workstation as mine, where genometools is not installed. But I never figured out on how to have more than one sequence per output file... Can you show an example?

ADD REPLYlink 8.1 years ago
Manu Prestat
3.9k
Entering edit mode
0

I can show an example. These commands worked for me: csplit panTro5.fa /\>chr.*/ {*} followed by for a in x*; do echo $a; mv $a $(head -1 $a).txt; done; As you said, very convenient on machines other than yours.

ADD REPLYlink 3.0 years ago
Biomonika (Noolean)
3.1k
3
Entering edit mode

fastasplit in the exonerate package's utilities (bottom of the page) does exactly this.

ADD COMMENTlink 9.5 years ago hurfdurf • 460 • updated 15 months ago RamRS 21k
3
Entering edit mode

I use fasta_splitter for this purpose. Its good too!

ADD COMMENTlink 5.4 years ago Prakki Rama ♦ 2.3k
2
Entering edit mode

Another option, from the BBMap package:

partition.sh in=file.fasta out=part%.fasta ways=5

This is multithreaded and very fast. It works on fastq also.

ADD COMMENTlink 3.2 years ago Brian Bushnell 16k
1
Entering edit mode
#!/usr/bin/perl -w
my $usage= <<EOF;

This is for split a fasta zigzagly.
usage: perl $0 x.fa num
Warning: you'd better mkdir a new directory for this.
                                                          Du Kang 2017-1-17

EOF
#obtain seq
open SEQ, $ARGV[0] or die $usage;
while (<SEQ>) {
        chomp;
        if (/>/) {
                s/>//;
                @_=split;
                $name=$_[0];
                }else{
                        $seq{$name}.=$_;
                     }
}

#rank according seq length
foreach $name (keys %seq){
        $length{$name}=length $seq{$name};
}

@id=sort {$length{$a} <=> $length{$b}} keys %length;

#zigzag out each seq
$filenum=$ARGV[1] or die $usage;
$n=0;
$flag=0;
foreach $name (@id) {
        if ($n>=$flag) {
                if ($n==$filenum) {
                        $flag=$flag+2;
                }else{
                        $flag=$n;
                        $n++;
                }
        }else{
                if ($n==1) {
                        $flag=$flag-2;
                }else{
                        $flag=$n;
                        $n--;
                }
        }
        open OUT, ">>hehe.$n" or die $!;
        print OUT ">$name\n$seq{$name}\n";
        close OUT;
}

It will rank the sequences according to the length, then zigzag dispatch them to make the result files almost even in size.

ADD COMMENTlink 3.2 years ago dukecomeback • 40 • updated 15 months ago RamRS 21k
0
Entering edit mode
/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}
ADD COMMENTlink 5.5 years ago arpit.singh1203 • 60 • updated 15 months ago RamRS 21k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0