Question

How to split fasta into seperate files by chromosome (in the header)

3

Entering edit mode

8.2 years ago

ethanagbaker ▴ 30

I have 1000 fasta files that have simulated reads, and I want to split each of these 1000 files into separate files (one per chromosome) as I need this for some further analysis. The header lines look like this >chr1:startpos:endpos.

I wrote this code (https://gist.github.com/ethanagbaker/6e40c58127b7ca8b9242) in Python, and it works, but it is very slow (ie it has been running for more than 24 hours on a cluster). This seems like it should be a really quick thing to do. Is there a better way to be doing this?

Thanks!

fasta split python genome sequence • 17k views

ADD COMMENT • link updated 8.2 years ago by Matt Shirley 10k • written 8.2 years ago by ethanagbaker ▴ 30

2

Entering edit mode

Maybe you'd like to try the solutions of these posts: How To Split A Multiple Fasta and how to convert a long fasta-file into many separate single fasta sequences

ADD REPLY • link 8.2 years ago by iraun 6.2k

1

Entering edit mode

faSplit from Jim Kent @UCSC. Link is for Linux but OS X version available along with source.

ADD REPLY • link 8.2 years ago by GenoMax 141k

3

Entering edit mode

8.2 years ago

Andrzej Zielezinski 11k

There are three reasons why your Python code works very slow:

You unnecessarily open and close each outut file million of times. Opening/closing files is very expensive. Most likely, you'll reduce the running time greatly if you open all the files for writing at the beginning. I suggest that you keep these files in a dictionary (d = {(name1, chr1): open(pwd+name+"_"+chr+".fa", 'w'), ...}. Then, in your forloop where you iterate over fasta records, put a variable output = d[(name, chrom)] and then write a record to that file by output.write(record.format('fasta')).
You're calling SeqIO.write(...) many times instead of once. Instead, use: output.write(record.format('fasta')).
Parsing FASTA using Biopython checks for some input errors and this might take some time. It does more than you need in your case.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

8.2 years ago

novice ★ 1.1k

Hi Ethan,

Try this:

#!/usr/bin/perl

while (<>) {
  open $fh, '>>', $1 if /\A>(.+?):/; 
  print $fh $_;
}

Use with a glob to include all your reads, like:

$ perl chr_split.pl reads*.fasta

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by novice ★ 1.1k

Ram · Accepted Answer · 2016-01-23

11

Entering edit mode

8.2 years ago

Matt Shirley 10k

$ pip install pyfaidx
$ faidx -x sequences.fa

Or you can use the Fasta class and write your own script to do the same thing. Your code is slow because it is opening a bunch of files in a loop, and then opening (the same files?) and reading them completely to get the sequences. Indexed file access will be orders of magnitude faster.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Matt Shirley 10k