Biostar Beta. Not for public use.
Question: GNU Parallel Block Issues
2
Entering edit mode

Hello,

I am using GNU parallel to speed up my BLAST jobs. I have seen the example outlined in the following post (https://www.biostars.org/p/63816/) and used the command:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -db db.fa -query - > results

I am noticing that in the BLAST output generated, sequences are missing (~30 from 5000), and if I run parallel and just examine the blocks that are generated, it seems that parallel loses a certain number of records (fasta records) each time it creates a new block. It doesn't seem like the block is breaking at the correct place. Does anyone have any clue as to why this is happening? Any help is appreciated.

Thank you.

ADD COMMENTlink 5.2 years ago salamayg • 20 • updated 5.2 years ago ole.tange ♦ 3.4k
0
Entering edit mode

To see if parallel is to blame try using 'cat' instead of blastp:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe cat > results

If you get all the sequences, then GNU Parallel is not losing them.

ADD COMMENTlink 5.2 years ago ole.tange ♦ 3.4k
Entering edit mode
0

Yes, when I do that it is missing sequences. If I wc the original input and the parallel output, it goes from 10000 to 9588. In the parallel output, the first line starts as:

CTTCACTAGCT

r9

sequence

whereas in the original file, it started as:

r1

sequence

r2

sequence

etc.

I think there are other instances of missing sequences (where each block is made) but it is hard to find them without going through 10000 lines manually. Do you have any ideas how I could trouble shoot this? Parallel is extremely useful to me but with this little issue I cannot use it.

edit: I have found another instance where there is a skip in the read numbers and where the header is altered.

r187.1 |SOURCES={GI=330827700,fw,273802-273903}|ERRORS={27:C,30:C,32:T,62:A,96:A,99:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
CGGGATCGTGCGGGTGGCTCTGTGCATCCTCGTTGGTTTGAGCGGGGGATGAGTCTGCCGTCAGTGCAGTGGGCCAGAGCAACACCCCGCCAAGCAACAAG

CE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACAGGTCGTGGTTGTGCAGCCCGCCAGCAGCAATTCGAGAGTCATGGGACGCCCCACTATGATGGACGCTCCCACCACGCCAGCATGCAGACCGTGCATCT

r195.2 |SOURCES={GI=330827700,bw,2584285-2584386}|ERRORS={36:G,40:A,42:A}|SOURCE_1="Aeromonas veronii B565 chromosome" (db14d9defaae9617cf9b20eb8bb2b46eefae8000)
ACCCGCAGCCGACCAAATTGCAGCTGCTGGGGAACGGTGCAGATGGCTACCGGTTGGGTTGATGCTTGCGGCTGCTGGCGAAACTGCACCCGAAGTCGGGC

ADD REPLYlink 5.2 years ago
salamayg
• 20
Entering edit mode
0

If that is true, you have found a bug. Can you make an example available for download? Quoting it here is unfortunately not enough as \n may be quoted wrongly.

ADD REPLYlink 5.2 years ago
ole.tange
♦ 3.4k
Entering edit mode
0

Sure. I hope this is an appropriate download: http://ge.tt/6d3f9g62/v/0?c?c

I also tried parallel and piping to cat on a different fasta file with more simple headers (to see if there was an issue in the header of the original file) but it would still do the same thing.

Also, the exact command I use is:

cat test.fna | parallel -k --gnu --block 100k --recstart '>' --pipe cat > results

ADD REPLYlink 5.2 years ago
salamayg
• 20
Entering edit mode
0

It gives exactly the same on 3 of my systems:

$ md5sum test.fna results 
cc27ec20250c65fdbcc0e23fa132eb83  test.fna
cc27ec20250c65fdbcc0e23fa132eb83  results

So what is hitting you is something on your local system. This changes the bug from simple fix to harder debugging, and that should not be done on Biostars.org. Post to bug-parallel@gnu.org and follow "REPORTING BUGS" in 'man parallel'.

ADD REPLYlink 5.2 years ago
ole.tange
♦ 3.4k
Entering edit mode
0

Okay thank you very much.

ADD REPLYlink 5.2 years ago
salamayg
• 20
Entering edit mode
0

Did you ever find a resolution to this issue? I have also experienced the same issue GNU parallel 20160422

ADD REPLYlink 3.7 years ago
danielfortin86
• 0

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0