Extract data from a large bed file
1
0
Entering edit mode
4.1 years ago
el24 ▴ 40

Hi all!

I have a big .bed file, which is about 100GB of data, and I want to extract specific columns from it in a timely manner. I was wondering if there is a useful tool that works best for huge bed files. (or could be bed.gz) Could you please tell me if you know any excellent tool (or Python package) that can help me generate my desire output?

An example of head of the data is:

chr1    10006   10018   M6176_1.02-NR2F1    0.00117 +   sequence=taaccctaaccc
chr1    10006   10020   M6432_1.02-PPARD    0.00034 +   sequence=taaccctaacccta
chr1    10008   10030   M6456_1.02-RREB1    0.00014 -   sequence=GGGTTAGGGTTAGGGTTAGGGT

And imagine I build my output using columns exact value of 1, 2, 3, 5, and 6th columns, and also TF of 4th column. Therefore, for the first line of my data, I like my output to be as follows:

chr1 10006 10018 NR2F1 0.00117 +

I know I can extract it using awk, by this command, but I hope there are faster ways to do it:

cat file.bed | awk '($5<0.01)gsub(/.*-/,"",$4) {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}'

My second question is that if you can give me some recommendations to have criteria for limiting the values on one column, for example, keeping a row when the 5th column being less than 0.00001 values.

Thank you very much!

bed alignment genome • 2.9k views
ADD COMMENT
2
Entering edit mode

Since you are interested in low-level text processing, I'd say that awk or basic python would be the way to go. Higher-level parsers provide useful shortcuts, but usually at the cost of speed.

ADD REPLY
4
Entering edit mode
4.1 years ago
wm ▴ 560

Have a simple test. It turns out, python is a bit faster than awk for your question. (before I have the test, I think AWK is faster).

I generate two files (3-million lines, 200MB; 30-million lines, 2GB) for the test based on your example. and make a few changes: (5-column < 0.01)

The test files:

$ wc -l test_3m_rows.bed test_30m_rows.bed
   3000000 test_3m_rows.bed
  30000000 test_30m_rows.bed

Awk version:

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_3m_rows.bed > out1.bed

real    0m5.135s
user    0m5.012s
sys     0m0.120s

time awk '$5<0.01{split($4,a,"-"); print $1"\t"$2"\t"$3"\t"a[2]"\t"$5"\t"$6}' test_30m_rows.bed > out2.bed

real    0m53.181s
user    0m49.875s
sys     0m1.244s

python version:

$ time python foo.py test_3m_rows.bed > out3.bed

real    0m4.416s
user    0m4.195s
sys     0m0.220s

$ time python foo.py test_30m_rows.bed > out4.bed

real    0m46.769s
user    0m43.935s
sys     0m2.396s

The output of each command:

$ wc -l out*.bed
   3000000 out1.bed
  30000000 out2.bed
   3000000 out3.bed
  30000000 out4.bed

Here is the python script:

$ cat foo.py
#!/usr/env/bin python3

import sys

with open(sys.argv[1]) as r:
    for line in r:
        fields = line.split('\t')
        if float(fields[4]) >= 0.01:
            continue
        fields[3] = fields[3].split('-')[1]
        print('\t'.join(fields[:6]))
ADD COMMENT
0
Entering edit mode

Thank you very much, this is very helpful!

ADD REPLY

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6