Question

How To Randomly Sample A Subset Of Lines From A Bed File

4

Entering edit mode

10.5 years ago

bede.portz ▴ 540

How do I randomly sample lines from a bed file? Specifically, I want to make a smaller bed file of ChIP-seq reads from a larger one, while maintaining the relative proportion of lines from each chromosome.

Thanks,

Bede

bed chip-seq • 15k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 10.5 years ago by bede.portz ▴ 540

score 11 · Answer 1 · 2013-10-30

11

Entering edit mode

10.5 years ago

Alex Reynolds 35k

To preserve the relative proportion of elements per chromosome, I think you will need to do a bit more work.

Overall procedure:

Split the BED input into per-chromosome files
Count the number of lines in each per-chromosome file
Calculate the proportion of lines in each per-chromosome file, based on the total number of lines in the input BED
Sample from each per-chromosome file separately, based on these proportions

Using BEDOPS sort-bed, first sort your BED data:

$ sort-bed elements.unsorted.bed > elements.bed

Split it with bedextract and count lines. The bedextract tool very efficiently splits BED files into per-chromosome files, and we put per-chromosome line counts into a text file called counts.txt:

$ for chr in `bedextract --list-chr elements.bed`; do \
    bedextract $chr elements.bed > elements.$chr.bed; \
    count=`wc -l elements.$chr.bed | cut -d' ' -f1`; \
    echo -e "$chr\t$count"; done \ 
> counts.txt

Let's assume that we ultimately want to sample a total of 1000 elements from the input BED file. You can change this value to whatever you need.

We next calculate a table of per-chromosome proportions and how many elements you will pull from each chromosome:

$ sampleSize=1000
$ sum=`cut -f2 counts.txt | perl -nle '$sum += $_ } END { print $sum' -`
$ awk -v sum=$sum -v sampleSize=$sampleSize '{print $0"\t"($2/sum)"\t"int(sampleSize*$2/sum)}' counts.txt > proportions.txt

Finally, do a random sample from each of the per-chromosome BED files, choosing a sample size for each chromosome that reflects its proportion in the larger input:

$ awk '{ \
    perChrName=$1; \
    perChrN=$4; \
    cmd="shuf elements."perChrName".bed | head -n "perChrN; \
    system(cmd); \
}' proportions.txt \
| sort-bed - \
> sample.bed

The output is a sort-bed-sorted BED file called sample.bed. This file contains roughly 1000 rows (this will be close, but not exact, depending on the cutoff values from the int() call in the awk statement where we calculate proportions). The number of samples from each chromosome will therefore roughly reflect the distribution of elements in the larger input BED file.

ADD COMMENT • link 10.5 years ago by Alex Reynolds 35k

0

Entering edit mode

This is a more thorough solution that seemingly ensures I equally represent each chromosome. However, it seems to me that if shuf it completely random, than data from each chromosome will be distributed randomly through the resulting file, thus taking head -somenumber will give me an approximately representative sampling?

ADD REPLY • link 10.5 years ago by bede.portz ▴ 540

1

Entering edit mode

Being completely random and maintaining your proportions are incompatible goals. After all, a single random sampling is just as likely to return entirely reads from chromosome 4 as it is to be distributed across all chromosomes.

ADD REPLY • link 10.5 years ago by Chris Miller 22k

1

Entering edit mode

Point taken, I will try to be more precise, and less colloquial, when seeking help in the future.

ADD REPLY • link 10.5 years ago by bede.portz ▴ 540

0

Entering edit mode

But this is not a single random sampling. If the OP wants to subset 10,000 lines then he is performing 10,000 random samplings out of the original population without replacement. Assuming a good distribution of reads across chromosomes in the original population (BED file) the probability of sampling 10,000 times and only collecting reads from chromosome 4 is vanishingly small.

ADD REPLY • link 10.5 years ago by kmcarr00 ▴ 290

0

Entering edit mode

You're not wrong, but if you really want to maintain the proportion with high fidelity, then you have to randomly sample an appropriate number from each chromosome. Whether random sampling from a shuffled file is 'good enough' will depend on the specific application.

ADD REPLY • link 10.5 years ago by Chris Miller 22k

0

Entering edit mode

I attempted to use this particular approach but for some reason only pulled intervals from chr1-3 and X, Y when doing this, while the bed file contains all of the chromosomes. Any reason you can see as to why this might occur? Thanks!

ADD REPLY • link 5.5 years ago by rbronste ▴ 420

0

Entering edit mode

Not sure. Is everything sorted?

ADD REPLY • link 5.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Actually have figured that out and applied it successfully except for one thing, I input a list of about 800K peaks and am looking for that 1000 thats outlined in your code however I get about 15K, here is how I ran it:

for chr in `bedextract --list-chr BNST_merged500kPeaks_allEB.bed`; do \
    bedextract $chr BNST_merged500kPeaks_allEB.bed > BNST_merged500kPeaks_allEB.$chr.bed; \
    count=`wc -l BNST_merged500kPeaks_allEB.$chr.bed | cut -d' ' -f1`; \
    echo -e "$chr\t$count"; done \ 
> BNST_merged500kPeaks_allEB_counts.txt

sampleSize=1000
sum=`cut -f2 counts.txt | perl -nle '$sum += $_ } END { print $sum' -`
awk -v sum=$sum -v sampleSize=$sampleSize '{print $0"\t"($2/sum)"\t"int(sampleSize*$2/sum)}' BNST_merged500kPeaks_allEB_counts.txt > BNST_merged500kPeaks_allEB_proportions.txt

awk '{ \
    perChrName=$1; \
    perChrN=$4; \
    cmd="shuf BNST_merged500kPeaks_allEB."perChrName".bed | head -n "perChrN; \
    system(cmd); \
}' BNST_merged500kPeaks_allEB_proportions.txt \
| sort-bed - \
> BNST_merged500kPeaks_allEB_downsample.bed

Im wondering if the following in the counts file might have a bearing on this and is there a way to remove the Un and Random at this stage of analysis and only focus on main mouse chromosomes:

chr1_GL456210_random    1
chr1_GL456211_random    5
chr1_GL456212_random    3
chr1_GL456221_random    1
chr2    59105
chr3    49119
chr4    47234
chr4_GL456216_random    14
chr4_JH584295_random    1
chr5    47250
chr5_JH584299_random    3
chr6    46638
chr7    40543
chr8    42224
chr9    41203
chrUn_GL456239  11
chrUn_GL456359  5
chrUn_GL456360  1
chrUn_GL456366  8

ADD REPLY • link 5.5 years ago by rbronste ▴ 420

0

Entering edit mode

You could use grep to remove non-nuclear elements:

$ grep -v -e "^chr.*_.*" in.txt > out.txt

This is GNU grep so make changes as needed for OS X etc.

ADD REPLY • link 5.5 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks! One final thing on this, wondering how to subsample intervals of similar lengths lets say an avg of 350bp?

ADD REPLY • link 5.4 years ago by rbronste ▴ 420

2

Entering edit mode

Just spitballing:

You could use a normal distribution model in R to generate a pool of interval lengths with a mean length of 350 bases, with a SD of 50:

lengths <- rnorm(n=1234, mean=350, sd=50)

Or you could use whatever distribution that best models your random intervals.

Generate a pool of start positions for a given chromosome (say, chr1 from hg38):

starts <- runif(n=1234, min=0, max=(248956422 - 350 - 1))

Subtracting 350 from the max ensures that a random element does not extend past the edge of chr1 (bounds). Subtracting another 1 ensures the interval is zero-indexed, half-open.

For each start position, you add the random length, converted to an integer value:

stops <- starts + unlist(lapply(lengths, as.integer))

Then make a vector of chromosomes:

chrs <- rep('chr1', 1234)

And bundle the sample into a data frame:

df <- data.frame(chrs, starts, stops)

From there you could export this to a text file or GenomicRanges object (maybe?).

This would need a little bit of work to package into a function for each chromosome and its assembly's bounds, but hopefully this gives an idea of what you could perhaps do.

ADD REPLY • link 5.4 years ago by Alex Reynolds 35k

0

Entering edit mode

I guess I can also take the aggregate bed files representing all chromosomes before downsampling and simply subsample intervals of 350bp in length and thereafter use your approach at the start of this thread to subsample those evenly across all chromosomes? Is there an easy way of taking just a BED file and pulling out intervals that average 350bp in length?

ADD REPLY • link 5.4 years ago by rbronste ▴ 420

1

Entering edit mode

The bedmap tool has options to report the sizes of reference and/or mapped elements, which you could use with awk to report elements of length 350:

$ bedmap --echo --echo-ref-size --delim '\t' elements.bed | awk '($NF == 350)' > subsample.bed

Or elements within some CI around 350:

$ bedmap --echo --echo-ref-size --delim '\t' elements.bed | awk '(($NF > 300) && ($NF < 400))' > subsample.bed

Etc.

ADD REPLY • link 5.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Perfect, thanks! Second option is what I am looking for in this instance.

ADD REPLY • link 5.4 years ago by rbronste ▴ 420

0

Entering edit mode

One other quick thing, is it possible in this scenario to match the subsampled ~350bp peaks in their read counts to the target peak set? Im basically working with about 10K target peaks at some varying read depth and would like to take the downsampled 10K peaks across all chromosomes that I have at 350bp and match their read depth to the target? Thanks in advance for any advice!

ADD REPLY • link 5.4 years ago by rbronste ▴ 420

1

Entering edit mode

Rejection sampling might work. You would generate an element of desired average length, measure reads, and either keep or discard, depending of whether that measurement falls within the bounds of the read statistics for your target peak set. Keep sampling until you have as many elements as you need to do your experiment, which match your criteria. https://en.m.wikipedia.org/wiki/Rejection_sampling

ADD REPLY • link 5.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Do you know of a package or tool that assists in rejection sampling or would you script it out yourself? Thanks!

ADD REPLY • link 5.4 years ago by rbronste ▴ 420

0

Entering edit mode

I would just script it. It's just the equivalent of a do...while loop. taking samples that meet your criteria and ignoring the rest, until you hit some predefined limit:

validSamples = [] // empty list of valid samples
N = 1234 // desired number of samples

// draw samples until you get enough that meet criteria
do {
    sample = drawSample()
    if sampleMeetsCriteria(sample) {
        addSampleToListOfValidSamples(sample)
        validSampleCount += 1
    }
    if validSampleCount >= N {
        break
    }
} while(True)

// do stuff with list of valid samples...

You basically need to write a sampleMeetsCriteria function. Here, that means an element having a certain number of reads that are in line with what is in your target peaks, perhaps.

ADD REPLY • link 5.4 years ago by Alex Reynolds 35k

score 5 · Answer 2 · 2013-10-30

5

Entering edit mode

10.5 years ago

Gabriel R. ★ 2.9k

cat input.bed |shuf | head -n [number of lines you want]

ADD COMMENT • link 10.5 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

This wont maintain the chromosome order (same number of lines from each chromosome, which can be or not shuffled).

This would require, sort -u with cut on the first column to fetch unique elements and then running a combination of grep and head for each element to generate the output file, which can be randomized using shuf, there might be a better solution as the file can be huge > 50 Million lines.

ADD REPLY • link 10.5 years ago by Sukhi Singh 11k

1

Entering edit mode

Based on the comments, it seems that a simple solution is:

$ cat input.bed | shuf > shufinput.bed
$ cat shufinput.bed | head -n > randomsubset.bed

This solution should maintain the sorted order of the lines in input.bed, while also generating a random subset of data that should equally represent each chromosome in randomsubset.bed.

However, trying this on a test set of data the ratios weren't maintained.

ADD REPLY • link 10.5 years ago by bede.portz ▴ 540

1

Entering edit mode

Based on how your question is specifically worded, I still don't think this is what you want, but if this is what you want, Irsan is correct that this is a useless use of cat. Just operate on the files directly and pipe the results from one command to the next:

$ shuf input.bed | head -n 1234 - > sampleFromEntireSet.bed

As you note, this will not preserve the relative proportions of elements, just like permuting a deck of sorted playing cards will destroy any ordering of suits and ranks. If you want to preserve ratios (i.e., get a randomly sampled subset of ranks for a given suit) then you'll need to permute subsets (i.e., first split the deck of cards by suit, before sampling for a subset of ranks within a suit).

ADD REPLY • link 10.5 years ago by Alex Reynolds 35k

1

Entering edit mode

useless use of cat

ADD REPLY • link 10.5 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thanks, SamuelL

ADD REPLY • link 10.5 years ago by bede.portz ▴ 540

score 1 · Answer 3 · 2013-10-30

1

Entering edit mode

10.5 years ago

Chris Miller 22k

You could also just grab every nth line (which is less random, but will maintain your proportions).

#grabs every 100th line, starting at line 0
cat input.bed | sed -n '0~100p' >output.bed

ADD COMMENT • link 10.5 years ago by Chris Miller 22k

score 1 · Answer 4 · 2021-03-16

1

Entering edit mode

3.1 years ago

cjgunase ▴ 50

bedtools sample -i input.bed -n 1000

will randomly sample 1000 rows from input.bed

ADD COMMENT • link 3.1 years ago by cjgunase ▴ 50

score 0 · Answer 5 · 2013-10-31

I wanted to have an one liner for this little tricky problem. So, I consulted Stack Overflow resulting in :

Try using awk

#!/bin/bash
numRow=3
awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }' file

Output:

chr1    3003204 3003454 *   37  +
chr1    3003235 3003485 *   37  +
chr1    3003148 3003152 *   37  -

chr11   71863609    71863647    *   37  +
chr11   71864025    71864275    *   37  +
chr11   71864058    71864308    *   37  -

chrY    90828920    90829170    *   23  -
chrY    90829096    90829346    *   23  +
chrY    90828924    90829174    *   23  -

If randomized order of lines is required inside the groups, then , just a little use of shuf

#!/bin/bash
numRow=3
shuf file | awk 'n[$1]<'$numRow' {a[$1]=a[$1]"\n"$0; n[$1]++} END { asort(a,b); for (x in b) print b[x] }'

Very fast and efficient solution

Time taken for a file of 10 Milllion reads

real    0m2.923s
user    0m2.859s
sys    0m0.063s

Ref : http://stackoverflow.com/questions/19690954/shuffling-a-large-text-file-without-with-group-order-maintained

Credits : jkshah

Cheers

Ram · Answer 6 · 2014-05-30

I wrote a tool called sample (Github site) that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.

When sampling without replacement, this application offers a few advantages over common shuf-based approaches:

It performs roughly 2.25-2.75x faster than shuf in informal tests on OS X and Linux hosts.
It uses much less memory than the usual reservoir sampling approach that stores a pool of sampled elements; instead, sample stores the start positions of sampled lines (roughly 8 bytes per line).
Using less memory also gives sample two advantages:
1. Avoid shuf: memory exhausted errors for whole-genome-scale input files.
2. Allow a larger pool of samples, before running out of system memory. For instance, a 2 GB allocation would allow a sample size up to ~268M random elements.

The sample tool stores a pool of line positions and makes two passes through the input file. One pass generates the sample of random positions, while the second pass uses those positions to print the sample to standard output. To minimize the expense of this second pass, we use mmaproutines to gain random access to data in the (regular) input file on both passes.

The benefit that mmap provided was significant. For comparison purposes, we also add a --cstdiooption to test the performance of the use of standard C I/O routines (fseek(), etc.); predictably, this performed worse than the mmap-based approach in all tests, but timing results were about identical withgshuf on OS X and still an average 1.5x improvement over shuf under Linux.

The sample tool can be used to sample from any text file delimited by single newline characters (BED, SAM, VCF, etc.). Also, using the --lines-per-offset option allows sampling and shuffling repeated multiples of newline character-delimited lines, useful for sampling from (for example) FASTQ files, where records are split across four lines.

By adding the --preserve-order option, the output sample preserves the input order. For example, when sampling from an input BED file that has been sorted by BEDOPS sort-bed - which applies a lexicographical sort on chromosome names and a numerical sort on start and stop coordinates - the sample will also have the same ordering applied, with a relatively small O(k logk) penalty for a sample of size k.

By omitting the sample size parameter, the sample tool will shuffle the entire file. This tool can be used to shuffle files that shuf has trouble with; however, in this use case it currently operates slower than shuf(where shuf can still be used). We recommend use of shuf for shuffling an entire file, or specifying the sample size (up to the line count, if known ahead of time), when possible.

One downside at this time is that sample does not process a standard input stream; the input must be a regular file.

score 0 · Answer 7 · 2019-07-03

Adapted this code snippet from heng Li's answer towards random sampling of fasta files using bioawk, see http://seqanswers.com/forums/showthread.php?t=16845

bioawk -c bed -v k=INT '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}' in.bed

where INT in k=INT is an integer indicating the number of random lines to extract