Splitting a fasta file based on the number of headers.
4
0
Entering edit mode
5.6 years ago

I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.

>gene1
ACTG...

>gene2
ATTT...

...

>gene2,700,000
GCAC...

The way I do it is;

grep -n "^>" my.fasta > headersofmy.fasta

This gives me the positional information of the headers.

1:>gene1
4:>gene2
11:>gene3
...
n:>gene2,700,000

I then use the positional information to grab a set number of genes;

awk 'NR>=position1&&NR<=position2'  my.fasta > set1.fasta

I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.

I broke it first in chunks of 500,000 headers then to 100,000.

I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.

Any suggestion on how to approach this?

bash python fasta • 3.0k views
ADD COMMENT
1
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

Hello sicat.paolo20 ,

There are multiple answers posted below. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer, if they work.
Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Sorry I was occupied with another issue and forgot to check my account again.

ADD REPLY
5
Entering edit mode
5.6 years ago

Hello,

seqkit split have these two options for it:

-p, --by-part int        split sequences into N parts
-s, --by-size int        split sequences into multi parts with N sequences

fin swimmer

ADD COMMENT
0
Entering edit mode

Sorry for the late reply. Thanks for this. this worked for me. It just took a while to install it.

ADD REPLY
0
Entering edit mode

Hello again,

have a look at conda. I described how to use it in the first part of this tutorial. After doing it, you won't have any istall issues anymore ;)

fin swimmer

ADD REPLY
3
Entering edit mode
5.6 years ago

If your fasta file is line 1 header, line 2 sequence, line 3 header...

You can use split unix command. If you want 100 000 sequences in each sub files (1 sequence = 2 lines)

split -l 200000 filename

Note that if you have sequences on multiple lines this will not work

ADD COMMENT
0
Entering edit mode

Thanks for the idea. I think I forgot that I could simply do this to the fasta file (convert the sequence into a single line). Cheers!

ADD REPLY
2
Entering edit mode
5.6 years ago

use fasplit from kentutils (https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 for 64 bit linux binaries)

ADD COMMENT
1
Entering edit mode

OP: Remember to chmod a+x faSplit after you download the binary.

ADD REPLY
0
Entering edit mode

Thanks. Haven't tried this and would look into it. Cheers,

ADD REPLY
1
Entering edit mode
5.6 years ago

The following will split multiline FASTA records by every SPLIT number of headers.

$ export SPLIT=1234
$ awk -vs=${SPLIT} 'BEGIN{ RS=">"; k=0; i=0; ks=sprintf("%06d",k); }{ if (length($0)>0) { printf("%d,%s\n",i,$1); printf(">%s",$0) >> "split."ks".fa"; i++; } if (i==s) { i=0; k++; ks=sprintf("%06d",k); } }' seqs.fa

The output files are written to split.000000.fa, split.000001.fa, split.000002.fa, etc.

Each file contains SPLIT records, except for the last file, which can contain from 1 to SPLIT records, depending on whether the number of records in seqs.fa divides SPLIT evenly.

ADD COMMENT
0
Entering edit mode

thanks for the bash solution. I have been starting to like using bash to solve issues like this. I think it will take a while for me to understand your one-liner. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2485 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6