Randomize order of sequences in a fasta file
4
1
Entering edit mode
8.7 years ago
aberry814 ▴ 80

I am looking for a simple script or tool that will randomize the order of sequences in a very large fasta file (8Gb). Note that I do not want to touch any sequence data within a given sequence, only the order of entire sequences.

After randomizing I would also like to use a function like "head" that takes the top X number of sequences from that randomized file.

I will be comparing the full randomized file and random subsets of that file, so while a single script that accomplishes both tasks seems practical, two separate scripts are more usefull and probably easier to write.

Thanks for the help

sequence • 8.5k views
ADD COMMENT
3
Entering edit mode

Something like this?

$ grep '^>' file.faa > headers_file.faa
#check the number of ids (which is number of sequences), number of ids = N
$ shuf -n N headers_file.faa > random_ids.faa
#use faSomeRecords to extract sequences
$ ./faSomeRecords random_ids.faa file.faa > final_sequences.faa
#You can use `head` with random_ids.faa to select the number of sequences / directly you can give the number to the `shuf` command

Here's the link to faSomeRecords

ADD REPLY
1
Entering edit mode

Alternatively for headers file:

grep "^>" file.fasta | sort -R > headers.txt

shuf is probably faster though..

Neither shuf or sort -R work with OS X though. Well, not without installing GNU core utils anyway..

ADD REPLY
0
Entering edit mode

sort -R works based on the alphabets contained in headers, so I thought shuf is more effective in randomizing.

ADD REPLY
0
Entering edit mode

The comment by 5heikki about it not working refers to the command not being available on a Mac, rather than one being more effective. In place of shuf, you can do this on a Mac:

perl -MList::Util=shuffle -e 'print shuffle <>'

I'm not going to comment further because I think this question has been thoroughly answered already, esp. in the links I provided in my other comment.

ADD REPLY
0
Entering edit mode

I like this, thanks. Simple and easily modified to fit the specific situation.

ADD REPLY
0
Entering edit mode

If you search the forum you will find a number of helpful posts (look in the right-hand column for references). E.g., Resampling Fastq Sequences Without Replacement and Choosing Random Set Of Seqs From Larger Set. Just FYI, adding to existing discussions is preferred over creating duplicate posts because that makes it easier to find answers.

ADD REPLY
2
Entering edit mode
8.7 years ago
thackl ★ 3.0k

Have a look at seq-shuf, it's a simple perl script that uses bash sed and shuffle. It is reasonable fast, basically handles files of arbitrary size and also reads from stdin.

# FASTA
cat A.fa [B.fa ..] | seq-shuf > AB-shuf.fa
# FASTQ
cat A.fq [B.fq ..] | seq-shuf > AB-shuf.fq
# FASTQ paired end
interleave A_[12].fq .. | seq-shuf --pe | interleaved-split 1> A-shuf_1.fq 2> A-shuf_2.fq

EDIT:

| file   | desc                                        |
|--------+---------------------------------------------|
| sr.fq  | short reads of 100 bp length in FASTQ       |
| gen.fa | generic FASTA sequences, 100-10k bp, N50 2k |
| asm.fa | long FASTA sequences, 100-300k, N50 100k    |

|        |          |   200 MB            |     2 GB            | 200GB               |
|        |          | time [s] | mem [GB] | time [s] | mem [GB] | time [s] | mem [GB] |
|--------+----------+----------+----------+----------+----------+----------+----------|
| sr.fq  | seq-shuf |      1.5 |      0.1 |       21 |      0.1 |    2199  | 0.1      |
|        | bbmap    |        3 |      0.6 |       40 |      4.2 | -        |          |
| gen.fa | seq-shuf |        1 |      0.1 |        9 |      0.1 |    1360  | 0.1      |
|        | bbmap    |      1.5 |      0.4 |       13 |      4.0 | -        |          |
| asm.fa | seq-shuf |      0.3 |      0.1 |        5 |      0.1 |    876   | 0.1      |
|        | bbmap    |      1.3 |      0.4 |       14 |      3.8 | -        |          |
ADD COMMENT
0
Entering edit mode

Don't get me wrong, the bbmap package is great, I actually use it a lot, thanks @Brian. But I got curious regarding the performance of my simple script - I did put quite some though into it after all. And as you can see above, it performs quite well. It's faster than bbmap and at the same time has a fixed memory consumption. In fact, it can handle entire NGS read files in reasonable time at very low RAM.

ADD REPLY
0
Entering edit mode

That's pretty impressive speed! Clearly I need to up my game :) Though, I doubt I can beat the performance of Linux built-in functions.

ADD REPLY
0
Entering edit mode

Well, the difference in speed isn't that great - less than 2x, and bbmap offers more flexibility etc. So I think you are fine :). And I have to admit, I got lucky there, my major focus was the memory footprint, I am using the built-ins mostly out of convenience / laziness and was just hoping, the'd be fast.

ADD REPLY
2
Entering edit mode
8.7 years ago

The BBMap package has a pair of programs which will do this:

shuffle.sh in=file.fa out=shuffled.fa
reformat.sh in=shuffled.fa out=topTen.fa reads=10

...which gives you the first 10 sequences in the randomized file. But shuffle loads the entire file into memory, so it's easier to subsample like this:

reformat.sh in=file.fa out=sampled.fa samplereadstarget=10

...which gives you 10 random sequences, or

reformat.sh in=file.fa out=sampled.fa samplerate=0.01

...which gives you 1% of the sequences at random. Neither of those load the file into memory. Reformat also has a samplebasestarget flag if you want a specific number of bases rather than number of sequences.

ADD COMMENT
0
Entering edit mode

BBMap looks great! I'm currently having trouble with the makefile, but this looks like everything I'll need handily.

ADD REPLY
0
Entering edit mode

BBMap is already compiled and runs in Java; you don't need to do a make. Just unzip it and then run the shell scripts.

ADD REPLY
0
Entering edit mode

great hint @Brian thanks :)

ADD REPLY
0
Entering edit mode
8.7 years ago

Because GNU shuf chokes with memory exhausted errors on large inputs, I wrote a tool to handle shuffling or sampling of very large generic text files called sample.

This tool can be used to shuffle FASTA files, as well as BED, FASTQ, and other text-based genomic data sets.

Unlike shuf and alternatives, sample does not load the entire file into memory, but instead finds the positions where lines or groups of lines start, and samples from those. The memory usage is 8 bytes (per offset) times the number of headers, which for an 8 Gb input will be very very low compared with other reservoir sampling approaches - and no separate index file required.

More information about sample is available on Github.

For example, to generate a uniformly, randomly shuffled sample (without replacement) of size N from a two-lined FASTA file (where header and sequence alternate on separate lines) called input.fa, you can run the following command:

$ sample --sample-size=N --lines-per-offset=2 input.fa > sample.fa

Replace N with the desired number of FASTA sequences.

Run sample --help for more options (sampling with replacement, etc.).

ADD COMMENT
0
Entering edit mode
8.7 years ago

pyfaidx is a good fit here:

The sequence identifiers and offsets are read from an index file, and the Fasta object does name-based lookups. That means you can get optimal performance (never reading more than a line of sequence in memory) while still using Python functions that operate on dictionary-like objects.

ADD COMMENT

Login before adding your answer.

Traffic: 2399 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6