Question

Choosing Random Set Of Seqs From Larger Set

13

Entering edit mode

12.1 years ago

razor ▴ 190

Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?

Something like:

command -size 10 -in lotsofseqs.fasta -out sample.fasta

random fasta • 17k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 12.1 years ago by razor ▴ 190

Ram · Answer 1 · 2012-03-20

24

Entering edit mode

12.1 years ago

Pierre Lindenbaum 161k

linearize your sequences, generate a random order with shuf and print the 'N' first lines.

cat input.fasta |\
awk '/^>/ { if(i>0) printf("\n"); i++; printf("%s\t",$0); next;} {printf("%s",$0);} END { printf("\n");}' |\
shuf |\
head -n 2 |\
awk '{printf("%s\n%s\n",$1,$2)}'

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 12.1 years ago by Pierre Lindenbaum 161k

4

Entering edit mode

I know someone is going to give this shuf solution. :) A problem with it is that shuf loads all the data into memory, which makes it not suitable for very large files. If we only downsample a small file from large one, my solution is much better. In addition, if the input parameter is a fraction rather than a fixed number, hashing will be even better.

ADD REPLY • link 12.1 years ago by lh3 33k

0

Entering edit mode

shuf uses reservoir sampling (bounded memory) since v8.22

ADD REPLY • link 10.0 years ago by pixelbeat • 0

2

Entering edit mode

I used it for one of my datasets and I've found that to make it more kind of "accurate" it would be better to add BEGIN{FS="\t"} in the last awk:

...

head -n 2 |\
awk 'BEGIN{FS="\t"}{printf("%s\n%s\n",$1,$2)}'

That way if the fasta header is longer, with at least one space, like:

>MC3.MG1.AS1.GP1.C1.G2 OG=COG1564|Taxon=bacteria|StartStop=11|ClusterSize=83
ATGAAAACAGAAAGCAAATATTGTCTCATTGTCACAGGTGGCTCGGTCGATCAGAATTTGCTCGCGACCGTTTATGA

after shuffling, it won't become:

>MC3.MG80.AS1.GP1.C14688.G3
OG=NOG80451|Taxon=bacteria|StartStop=11|ClusterSize=2
>MC3.MG235.AS1.GP1.C49859.G6...

Hope it helps

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by biouser ▴ 30

1

Entering edit mode

shuf is not native to Macs.

ADD REPLY • link 12.1 years ago by Martin A Hansen 3.0k

1

Entering edit mode

maasha, if you want to install them, and use brew, then brew install xz coreutils will get you shuf (defined as gshuf)

ADD REPLY • link 12.1 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

that's it! no dependencies, bio packages, just basic commands. (somehow i never used shuf before.)

ADD REPLY • link 12.1 years ago by razor ▴ 190

0

Entering edit mode

In place of shuf, you can do this on a Mac:

perl -MList::Util -e 'print List::Util::shuffle <>'

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 12.1 years ago by SES 8.6k

0

Entering edit mode

A slight variation of the linearization with awk, explained in more detail: https://riptutorial.com/bioinformatics/example/14692/linearize-a-fasta-sequence-with-awk

DISCLAIMER: This code and its explanation is elaborated by someone else. I just add it as supplementary information to further understand the best answer provided to the question

ADD REPLY • link 23 months ago by JMMM ▴ 10

Ram · Answer 2 · 2012-03-20

7

Entering edit mode

12.1 years ago

lh3 33k

I used to provide an answer at seqanswers.

You need bioawk for it to work. With bioawk, you can:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz

This solution is based on reservoir sampling. It is optimal in terms of both time and space complexity and works with an arbitrarily large input file. Most other solutions need to read all data into memory.

Of course you can also use perl/python to re-implement the awk line. I use bioawk only because it has a built-in fasta/fastq parser.

ADD COMMENT • link updated 23 months ago by Ram 43k • written 12.1 years ago by lh3 33k

1

Entering edit mode

This is without replacement.

ADD REPLY • link 12.1 years ago by lh3 33k

1

Entering edit mode

No, this is not true. Reservoir sampling may seem too simplistic, but it guarantees uniform randomness. You can find the proof via google. It is theoretically the best solution for sampling a fixed number of records without replacement. You even do not need to know the size of reservior.

ADD REPLY • link 12.1 years ago by lh3 33k

1

Entering edit mode

The first time I read about reservoir sampling is when I saw that seqanswers post. By intuition, I think there must be a solution working on a data stream - those who work on endless time series must be faced with such a situation all the time. Nonetheless, I did not know the answer, so I googled and found reservior sampling.

ADD REPLY • link 12.1 years ago by lh3 33k

0

Entering edit mode

The thing about reservoir sampling is that there is 'decreasing probability', once the k first elements have been filled, that a new sequence will be retained, thus making sequences at the end of the file less likely to be taken and sequences from the beginning of the file more likely to be replaced. How close to random selection will the resulting list of sequences really be? I provide a solution that works with gigantic files, at the cost of having to scan the file twice and doing some extra trimming but ensuring that everybody gets equal chances.

ADD REPLY • link 12.1 years ago by Eric Normandeau 11k

0

Entering edit mode

I'll look it up, it is a very interesting solution.

ADD REPLY • link 12.1 years ago by Eric Normandeau 11k

0

Entering edit mode

I'm curious, did you learn about reservoir sampling in a book about algorithms or when searching for a solution to a specific problem you had? (Just trying to decide whether to spend time to read through a book on algo ;)

ADD REPLY • link 12.1 years ago by Eric Normandeau 11k

score 6 · Answer 3 · 2012-03-20

This page has lots of useful sequence utilities, including a script named random_sequence_reads.pl (zip or tarball to download). It does sampling with/without replacement and seems to work, based on my quick test.

Note: the test.sh script samples all sequences from the input file which is confusing and not useful - edit it to change the parameters.

Ram · Answer 4 · 2012-03-20

3

Entering edit mode

12.1 years ago

Martin A Hansen 3.0k

This can be done with Biopieces like this:

read_fasta -i lotsofseqs.fasta | random_records -n 10 | write_fasta -o sample.fasta -x

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 12.1 years ago by Martin A Hansen 3.0k

0

Entering edit mode

will definitely install this later, after installing all the prereqs, external tools, etc.

ADD REPLY • link 12.1 years ago by razor ▴ 190

Ram · Answer 5 · 2012-03-20

Here is something quick and dirty in python:

import sys,random
from Bio import SeqIO

num = int(sys.argv[2])

outArray = [record for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta') if random.randint(0,1)]
while len(outArray)!= num:
    newArray = []
    for record in outArray:
        if len(newArray) == num:
            break
        if random.randint(0,1):
            newArray.append(record)
    outArray = newArray

for record in outArray:
    print ">" + record.id
    print record.seq

Use by:

python myScript.py in.fasta 10 > subset10.fa

It basically iteratively randomly takes a subset (around half) until it gets to the number of sequences you want. So if you have 1000 sequences, it will take randomly around 500 of that, then 250 of that, 125....so on the so forth until it gets to the number if you want.

This script will probably only work well if you have a large amount of sequences (1000+) and the subset you want is well below half of the total amount.

Ram · Answer 6 · 2012-03-23

A little late to the fray, but anyway...

Here's a solution for sampling fasta without replacement.

It uses bash functions, and GNU coreutils' shuf command, but does not suffer the memory requirements of keeping the sequences in memory since it shuffles only the byte offsets into the file where the sequences begin.

function fastaSubsample () {
    ## PURPOSE: choose [HTML] random records from a (presumably
    ## well-formed) fasta file in [HTML].  The output sequences
    ## are in same order as input (due to the `sort -n`)
    n="$1"
    path="$2"
    shift 2
    fastaBreaks ${path} | shuf -n ${n} | sort -n | fastaRecordsAt ${path}
}

function fastaBreaks () {
    ## PURPOSE: output the byte positions of each fasta record on STDIN
    perl -ne 'BEGIN{$told=0}; print($told.qq{\n}) if m/^>/;$told=tell();' "$@"
}

function fastaRecordsAt () {
    ## PURPOSE: extract the fasta records from# [HTML] starting at
    ## each byte offset appearing on STDIN.
    path="$1"
    shift 1
    perl -anse 'BEGIN{open(PATH,$path)}; chomp;  $told = $_; seek(PATH,$told,0); print(scalar(readline(PATH))); while (($l=readline(PATH)) && ($l !~ m/^>/)) {print($l)} ' -- -path=$path
}

Ram · Answer 7 · 2012-06-15

Here is another Python solution, useful for processing multiple file simultaneously (must be in same dir):

import random
import glob
inFiles = glob.glob('*.fas')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))


for i in range(len(inFiles)):

        for k in range(3):
                fNames = []
                fSeqs = []
                outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fasta', 'wt'))
                for line in open(inFiles[i]):
                        if (line[0] == '>'):
                                fNames.append(line)
                        else:
                                fSeqs.append(line)
                curr = (len(outFiles)-1)
                for j in range(num):
                        a = random.randint(0, (len(fNames)-1))
                        outFiles[curr].write(fNames.pop(a))
                        outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")

Basically you just enter how large the subsample should be, and it will give you three random subsamplings (without replacement). Of course if you only need one then only use one. If the file has less sequences than request this script will die.

I found this useful in a 16S tagged pyrosequencing project I've been working on.

score 2 · Answer 8 · 2014-03-29

I offer one solution here to sample without replacement, which does not require any subsampling and uses roughly 24 bytes per FASTA sequence.

The tradeoff is that one has to read through the file twice, the first time to read in byte offsets and a second and subsequent time for each random sample. But it isn't necessary to read in any of the FASTA set into memory, just line offsets.

Ram · Answer 9 · 2014-04-10

You can use reservoir sampling which is O(N) and does it in one pass (quite handy for millions of Illumina paired-end reads). I have written one-liners to do that in awk:

score 1 · Answer 10 · 2018-11-15

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", len(seq)

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
  choose = random.randint(1,totseq-1)               #Choose a random sequence record number
  for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
    if choose == outrandseq[j]:
      choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
  outrandseq.append(choose)
  print choose
  outseq = seq[choose]
  outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

score 1 · Answer 11 · 2018-11-16

import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. 
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
  choice = random.randint(1,totseq)
  outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
    duplicate = 0
    for i in range(numrandseq):
      for j in range(i+1, numrandseq):
        if outrandseqset[i] == outrandseqset[j]:
            outrandseqset[j] = random.randint(1,totseq)
            duplicate = 1


i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
  print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
  seqnum = outrandseqset[i]
  outseq = seq[seqnum]
  outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()