GNU Parallel Block Issues
1
3
Entering edit mode
9.4 years ago
salamayg ▴ 30

Hello,

I am using GNU parallel to speed up my BLAST jobs. I have seen the example outlined in the following post (Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them) and used the command:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -db db.fa -query - > results

I am noticing that in the BLAST output generated, sequences are missing (~30 from 5000), and if I run parallel and just examine the blocks that are generated, it seems that parallel loses a certain number of records (fasta records) each time it creates a new block. It doesn't seem like the block is breaking at the correct place. Does anyone have any clue as to why this is happening? Any help is appreciated.

Thank you.

parallel blast • 3.8k views
ADD COMMENT
0
Entering edit mode
9.4 years ago
ole.tange ★ 4.4k

To see if parallel is to blame try using 'cat' instead of blastp:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe cat > results

If you get all the sequences, then GNU Parallel is not losing them.

ADD COMMENT
0
Entering edit mode

Yes, when I do that it is missing sequences. If I wc the original input and the parallel output, it goes from 10000 to 9588. In the parallel output, the first line starts as:

>CTTCACTAGCT
>r9
sequence

whereas in the original file, it started as:

>r1
sequence
>r2
sequence

etc.

I think there are other instances of missing sequences (where each block is made) but it is hard to find them without going through 10000 lines manually. Do you have any ideas how I could trouble shoot this? Parallel is extremely useful to me but with this little issue I cannot use it.

edit: I have found another instance where there is a skip in the read numbers and where the header is altered.

>r187.1 |SOURCES={GI=330827700,fw,273802-273903}|ERRORS={27:C,30:C,32:T,62:A,96:A,99:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
CGGGATCGTGCGGGTGGCTCTGTGCATCCTCGTTGGTTTGAGCGGGGGATGAGTCTGCCGTCAGTGCAGTGGGCCAGAGCAACACCCCGCCAAGCAACAAG
>CE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACAGGTCGTGGTTGTGCAGCCCGCCAGCAGCAATTCGAGAGTCATGGGACGCCCCACTATGATGGACGCTCCCACCACGCCAGCATGCAGACCGTGCATCT
>r195.2 |SOURCES={GI=330827700,bw,2584285-2584386}|ERRORS={36:G,40:A,42:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACCCGCAGCCGACCAAATTGCAGCTGCTGGGGAACGGTGCAGATGGCTACCGGTTGGGTTGATGCTTGCGGCTGCTGGCGAAACTGCACCCGAAGTCGGGC
ADD REPLY
0
Entering edit mode

If that is true, you have found a bug. Can you make an example available for download? Quoting it here is unfortunately not enough as \n may be quoted wrongly.

ADD REPLY
0
Entering edit mode

Sure. I hope this is an appropriate download: http://ge.tt/6d3f9g62/v/0?c?c

I also tried parallel and piping to cat on a different fasta file with more simple headers (to see if there was an issue in the header of the original file) but it would still do the same thing.

Also, the exact command I use is:

cat test.fna | parallel -k --gnu --block 100k --recstart '>' --pipe cat > results
ADD REPLY
0
Entering edit mode

It gives exactly the same on 3 of my systems:

$ md5sum test.fna results 
cc27ec20250c65fdbcc0e23fa132eb83  test.fna
cc27ec20250c65fdbcc0e23fa132eb83  results

So what is hitting you is something on your local system. This changes the bug from simple fix to harder debugging, and that should not be done on Biostars.org. Post to bug-parallel@gnu.org and follow "REPORTING BUGS" in 'man parallel'.

ADD REPLY
0
Entering edit mode

Okay thank you very much.

ADD REPLY
0
Entering edit mode

Did you ever find a resolution to this issue? I have also experienced the same issue GNU parallel 20160422

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6