Question

Splitting a fasta file based on the number of headers.

0

Entering edit mode

5.6 years ago

sicat.paolo20 ▴ 30

I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.

>gene1
ACTG...

>gene2
ATTT...

...

>gene2,700,000
GCAC...

The way I do it is;

grep -n "^>" my.fasta > headersofmy.fasta

This gives me the positional information of the headers.

1:>gene1
4:>gene2
11:>gene3
...
n:>gene2,700,000

I then use the positional information to grab a set number of genes;

awk 'NR>=position1&&NR<=position2'  my.fasta > set1.fasta

I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.

I broke it first in chunks of 500,000 headers then to 100,000.

I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.

Any suggestion on how to approach this?

bash python fasta • 3.0k views

ADD COMMENT • link updated 5.6 years ago by Alex Reynolds 35k • written 5.6 years ago by sicat.paolo20 ▴ 30

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

Hello sicat.paolo20 ,

There are multiple answers posted below. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer, if they work.
Upvote|Bookmark|Accept

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Sorry I was occupied with another issue and forgot to check my account again.

ADD REPLY • link 5.6 years ago by sicat.paolo20 ▴ 30

3

Entering edit mode

5.6 years ago

Bastien Hervé 5.3k

If your fasta file is line 1 header, line 2 sequence, line 3 header...

You can use split unix command. If you want 100 000 sequences in each sub files (1 sequence = 2 lines)

split -l 200000 filename

Note that if you have sequences on multiple lines this will not work

ADD COMMENT • link 5.6 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Thanks for the idea. I think I forgot that I could simply do this to the fasta file (convert the sequence into a single line). Cheers!

ADD REPLY • link 5.6 years ago by sicat.paolo20 ▴ 30

2

Entering edit mode

5.6 years ago

cpad0112 21k

use fasplit from kentutils (https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 for 64 bit linux binaries)

ADD COMMENT • link 5.6 years ago by cpad0112 21k

1

Entering edit mode

OP: Remember to chmod a+x faSplit after you download the binary.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Thanks. Haven't tried this and would look into it. Cheers,

ADD REPLY • link 5.6 years ago by sicat.paolo20 ▴ 30

1

Entering edit mode

5.6 years ago

Alex Reynolds 35k

The following will split multiline FASTA records by every SPLIT number of headers.

$ export SPLIT=1234
$ awk -vs=${SPLIT} 'BEGIN{ RS=">"; k=0; i=0; ks=sprintf("%06d",k); }{ if (length($0)>0) { printf("%d,%s\n",i,$1); printf(">%s",$0) >> "split."ks".fa"; i++; } if (i==s) { i=0; k++; ks=sprintf("%06d",k); } }' seqs.fa

The output files are written to split.000000.fa, split.000001.fa, split.000002.fa, etc.

Each file contains SPLIT records, except for the last file, which can contain from 1 to SPLIT records, depending on whether the number of records in seqs.fa divides SPLIT evenly.

ADD COMMENT • link 5.6 years ago by Alex Reynolds 35k

0

Entering edit mode

thanks for the bash solution. I have been starting to like using bash to solve issues like this. I think it will take a while for me to understand your one-liner. Thanks!

ADD REPLY • link 5.6 years ago by sicat.paolo20 ▴ 30

score 5 · Accepted Answer · 2018-09-03

5

Entering edit mode

5.6 years ago

finswimmer 16k

Hello,

seqkit split have these two options for it:

-p, --by-part int        split sequences into N parts
-s, --by-size int        split sequences into multi parts with N sequences

fin swimmer

ADD COMMENT • link 5.6 years ago by finswimmer 16k

0

Entering edit mode

Sorry for the late reply. Thanks for this. this worked for me. It just took a while to install it.

ADD REPLY • link 5.6 years ago by sicat.paolo20 ▴ 30

0

Entering edit mode

Hello again,

have a look at conda. I described how to use it in the first part of this tutorial. After doing it, you won't have any istall issues anymore ;)

fin swimmer

ADD REPLY • link 5.6 years ago by finswimmer 16k