Question

Randomize order of sequences in a fasta file

1

Entering edit mode

8.7 years ago

aberry814 ▴ 80

I am looking for a simple script or tool that will randomize the order of sequences in a very large fasta file (8Gb). Note that I do not want to touch any sequence data within a given sequence, only the order of entire sequences.

After randomizing I would also like to use a function like "head" that takes the top X number of sequences from that randomized file.

I will be comparing the full randomized file and random subsets of that file, so while a single script that accomplishes both tasks seems practical, two separate scripts are more usefull and probably easier to write.

Thanks for the help

sequence • 8.5k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by aberry814 ▴ 80

3

Entering edit mode

Something like this?

$ grep '^>' file.faa > headers_file.faa
#check the number of ids (which is number of sequences), number of ids = N
$ shuf -n N headers_file.faa > random_ids.faa
#use faSomeRecords to extract sequences
$ ./faSomeRecords random_ids.faa file.faa > final_sequences.faa
#You can use `head` with random_ids.faa to select the number of sequences / directly you can give the number to the `shuf` command

Here's the link to faSomeRecords

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.7 years ago by venu 7.1k

1

Entering edit mode

Alternatively for headers file:

grep "^>" file.fasta | sort -R > headers.txt

shuf is probably faster though..

Neither shuf or sort -R work with OS X though. Well, not without installing GNU core utils anyway..

ADD REPLY • link 8.7 years ago by 5heikki 11k

0

Entering edit mode

sort -R works based on the alphabets contained in headers, so I thought shuf is more effective in randomizing.

ADD REPLY • link 8.7 years ago by venu 7.1k

0

Entering edit mode

The comment by 5heikki about it not working refers to the command not being available on a Mac, rather than one being more effective. In place of shuf, you can do this on a Mac:

perl -MList::Util=shuffle -e 'print shuffle <>'

I'm not going to comment further because I think this question has been thoroughly answered already, esp. in the links I provided in my other comment.

ADD REPLY • link 8.7 years ago by SES 8.6k

0

Entering edit mode

I like this, thanks. Simple and easily modified to fit the specific situation.

ADD REPLY • link 8.7 years ago by aberry814 ▴ 80

0

Entering edit mode

If you search the forum you will find a number of helpful posts (look in the right-hand column for references). E.g., Resampling Fastq Sequences Without Replacement and Choosing Random Set Of Seqs From Larger Set. Just FYI, adding to existing discussions is preferred over creating duplicate posts because that makes it easier to find answers.

ADD REPLY • link 8.7 years ago by SES 8.6k

Ram · Answer 1 · 2015-08-17

2

Entering edit mode

8.7 years ago

thackl ★ 3.0k

Have a look at seq-shuf, it's a simple perl script that uses bash sed and shuffle. It is reasonable fast, basically handles files of arbitrary size and also reads from stdin.

# FASTA
cat A.fa [B.fa ..] | seq-shuf > AB-shuf.fa
# FASTQ
cat A.fq [B.fq ..] | seq-shuf > AB-shuf.fq
# FASTQ paired end
interleave A_[12].fq .. | seq-shuf --pe | interleaved-split 1> A-shuf_1.fq 2> A-shuf_2.fq

EDIT:

| file   | desc                                        |
|--------+---------------------------------------------|
| sr.fq  | short reads of 100 bp length in FASTQ       |
| gen.fa | generic FASTA sequences, 100-10k bp, N50 2k |
| asm.fa | long FASTA sequences, 100-300k, N50 100k    |

|        |          |   200 MB            |     2 GB            | 200GB               |
|        |          | time [s] | mem [GB] | time [s] | mem [GB] | time [s] | mem [GB] |
|--------+----------+----------+----------+----------+----------+----------+----------|
| sr.fq  | seq-shuf |      1.5 |      0.1 |       21 |      0.1 |    2199  | 0.1      |
|        | bbmap    |        3 |      0.6 |       40 |      4.2 | -        |          |
| gen.fa | seq-shuf |        1 |      0.1 |        9 |      0.1 |    1360  | 0.1      |
|        | bbmap    |      1.5 |      0.4 |       13 |      4.0 | -        |          |
| asm.fa | seq-shuf |      0.3 |      0.1 |        5 |      0.1 |    876   | 0.1      |
|        | bbmap    |      1.3 |      0.4 |       14 |      3.8 | -        |          |

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by thackl ★ 3.0k

0

Entering edit mode

Don't get me wrong, the bbmap package is great, I actually use it a lot, thanks @Brian. But I got curious regarding the performance of my simple script - I did put quite some though into it after all. And as you can see above, it performs quite well. It's faster than bbmap and at the same time has a fixed memory consumption. In fact, it can handle entire NGS read files in reasonable time at very low RAM.

ADD REPLY • link 8.7 years ago by thackl ★ 3.0k

0

Entering edit mode

That's pretty impressive speed! Clearly I need to up my game :) Though, I doubt I can beat the performance of Linux built-in functions.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Well, the difference in speed isn't that great - less than 2x, and bbmap offers more flexibility etc. So I think you are fine :). And I have to admit, I got lucky there, my major focus was the memory footprint, I am using the built-ins mostly out of convenience / laziness and was just hoping, the'd be fast.

ADD REPLY • link 8.7 years ago by thackl ★ 3.0k

Ram · Answer 2 · 2015-08-17

2

Entering edit mode

8.7 years ago

Brian Bushnell 20k

The BBMap package has a pair of programs which will do this:

shuffle.sh in=file.fa out=shuffled.fa
reformat.sh in=shuffled.fa out=topTen.fa reads=10

...which gives you the first 10 sequences in the randomized file. But shuffle loads the entire file into memory, so it's easier to subsample like this:

reformat.sh in=file.fa out=sampled.fa samplereadstarget=10

...which gives you 10 random sequences, or

reformat.sh in=file.fa out=sampled.fa samplerate=0.01

...which gives you 1% of the sequences at random. Neither of those load the file into memory. Reformat also has a samplebasestarget flag if you want a specific number of bases rather than number of sequences.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

BBMap looks great! I'm currently having trouble with the makefile, but this looks like everything I'll need handily.

ADD REPLY • link 8.7 years ago by aberry814 ▴ 80

0

Entering edit mode

BBMap is already compiled and runs in Java; you don't need to do a make. Just unzip it and then run the shell scripts.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

great hint @Brian thanks :)

ADD REPLY • link 3.1 years ago by Apex92 ▴ 280

Ram · Answer 3 · 2015-08-17

Because GNU shuf chokes with memory exhausted errors on large inputs, I wrote a tool to handle shuffling or sampling of very large generic text files called sample.

This tool can be used to shuffle FASTA files, as well as BED, FASTQ, and other text-based genomic data sets.

Unlike shuf and alternatives, sample does not load the entire file into memory, but instead finds the positions where lines or groups of lines start, and samples from those. The memory usage is 8 bytes (per offset) times the number of headers, which for an 8 Gb input will be very very low compared with other reservoir sampling approaches - and no separate index file required.

More information about sample is available on Github.

For example, to generate a uniformly, randomly shuffled sample (without replacement) of size N from a two-lined FASTA file (where header and sequence alternate on separate lines) called input.fa, you can run the following command:

$ sample --sample-size=N --lines-per-offset=2 input.fa > sample.fa

Replace N with the desired number of FASTA sequences.

Run sample --help for more options (sampling with replacement, etc.).

Ram · Answer 4 · 2015-08-17

0

Entering edit mode

8.7 years ago

Matt Shirley 10k

pyfaidx is a good fit here:

The sequence identifiers and offsets are read from an index file, and the Fasta object does name-based lookups. That means you can get optimal performance (never reading more than a line of sequence in memory) while still using Python functions that operate on dictionary-like objects.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by Matt Shirley 10k