Biostar Beta. Not for public use.
Automating to convert multiple fastq files into one fastq file
0
Entering edit mode
17 months ago
zhou_1228 • 0

I got six fastq files (three forward and three reverse) for every sample in 96-well plate via NGS. As the first step of SNP calling, I need convert these six files into two files (forward and reverse fastq file) for each sample. Now I am trying to write a shell scripts to automatically merge every three forward (reverse) files into one for multiple samples. Below is the scripts I wrote to automatically convert one fastq file to one bwa file for many samples, but I am asking the scripts to convert three files into one. Thank you.

for fq in ~/NGS/*.fastq
    do
    echo "working with file $fq"

    base=$(basename $fq .fastq)
    echo "base name is $base"

    bwa=~/results/bwa/${base}.bwa

    bwa aln -t 4 GMbwaidx $fq > $bwa
    done

My six files for one sample look like this:

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq
143_P001_WB01_S13_L002_R1_001.fastq

142_P001_WB01_S1751_L008_R2_001.fastq
143_P001_WB01_S13_L001_R2_001.fastq
143_P001_WB01_S13_L002_R2_001.fastq
SNP sequencing • 434 views
ADD COMMENTlink
1
Entering edit mode

Hello zhou_1228,

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq

and how do you know that these files belong to the same sample? Which part of the filename give that information?

fin swimmer

ADD REPLYlink
0
Entering edit mode

P001_WB01 represent plate 1, well No. B1

ADD REPLYlink
0
Entering edit mode

convert one fastq file to one bwa file

There is no bwa file (format), bwa outputs alignments in the SAM format. For this reason, I would write:

bwa=~/results/bwa/${base}.sam
ADD REPLYlink
2
Entering edit mode
15 months ago
Malcolm.Cook ♦ 1.0k
kansas, usa

Install and use GNU Parallel.

Then use the following model. Remove --dry when you're ready to run:

nPlates=3
nWells=4
parallel -k --dry 'bwa aln -t 4 GMbwaidx <(cat NGS/*{1}_{2}*.fastq) > {1}_{2}.sam' :::: <( seq -f 'P%03g' ${nPlates} ) <(seq -f 'WB%02g' ${nWells} )
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB01*.fastq) > P001_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB02*.fastq) > P001_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB03*.fastq) > P001_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB04*.fastq) > P001_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB01*.fastq) > P002_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB02*.fastq) > P002_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB03*.fastq) > P002_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB04*.fastq) > P002_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB01*.fastq) > P003_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB02*.fastq) > P003_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB03*.fastq) > P003_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB04*.fastq) > P003_WB04.sam

Notes:

The above approach

  • depends upon bwa's ability to stream input
  • works with any number of fastq files per plate_well combination.
  • is not using parallel's ability to run multiple jobs, since presumably you have -t 4 threads available to you
  • assumes your shell is bash, and depends upon its capability for Process Substitution
ADD COMMENTlink
0
Entering edit mode

From my understanding, you run two commands, cat and bwa together in your model. But in my case, for each sample, I firstly need merge three R1.fastq files into one -F.fastq and another three R2.fastq to one -R.fastq, separately. And then run command "bwa mem GMbwaidx -F.fastq -R.fastq > *.sam" to generate sam file. Do you have any suggestion to automatically run these two steps?

ADD REPLYlink
0
Entering edit mode

Sure. The approach is the same; you just need two calls to cat, using slightly different file wildcarding (aka globbing) in each. Also, I now realize your well identifier has a row and a column component. In this updated example, for brevity, I limit to the first three rows, A through C, and the first two zero-padded columns, 01 through 02:

plate=$(seq -f 'P%03g' 3)
row=$(echo {A..C})
col=$(seq -f '%02g' 2 )
parallel -k --dry 'bwa mem GMbwaidx <(cat NGS/*{1}_W{2}{3}*_R1_*.fastq) <(cat NGS/*{1}_W{2}{3}*_R2_*.fastq) > {1}_W{2}{3}.sam' ::: $plate ::: $row ::: $col
bwa mem GMbwaidx <(cat NGS/*P001_WA01*_R1_*.fastq) <(cat NGS/*P001_WA01*_R2_*.fastq) > P001_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WA02*_R1_*.fastq) <(cat NGS/*P001_WA02*_R2_*.fastq) > P001_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB01*_R1_*.fastq) <(cat NGS/*P001_WB01*_R2_*.fastq) > P001_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB02*_R1_*.fastq) <(cat NGS/*P001_WB02*_R2_*.fastq) > P001_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC01*_R1_*.fastq) <(cat NGS/*P001_WC01*_R2_*.fastq) > P001_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC02*_R1_*.fastq) <(cat NGS/*P001_WC02*_R2_*.fastq) > P001_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA01*_R1_*.fastq) <(cat NGS/*P002_WA01*_R2_*.fastq) > P002_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA02*_R1_*.fastq) <(cat NGS/*P002_WA02*_R2_*.fastq) > P002_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB01*_R1_*.fastq) <(cat NGS/*P002_WB01*_R2_*.fastq) > P002_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB02*_R1_*.fastq) <(cat NGS/*P002_WB02*_R2_*.fastq) > P002_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC01*_R1_*.fastq) <(cat NGS/*P002_WC01*_R2_*.fastq) > P002_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC02*_R1_*.fastq) <(cat NGS/*P002_WC02*_R2_*.fastq) > P002_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA01*_R1_*.fastq) <(cat NGS/*P003_WA01*_R2_*.fastq) > P003_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA02*_R1_*.fastq) <(cat NGS/*P003_WA02*_R2_*.fastq) > P003_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB01*_R1_*.fastq) <(cat NGS/*P003_WB01*_R2_*.fastq) > P003_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB02*_R1_*.fastq) <(cat NGS/*P003_WB02*_R2_*.fastq) > P003_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC01*_R1_*.fastq) <(cat NGS/*P003_WC01*_R2_*.fastq) > P003_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC02*_R1_*.fastq) <(cat NGS/*P003_WC02*_R2_*.fastq) > P003_WC02.sam

As rewritten, the approach

ADD REPLYlink
0
Entering edit mode

Thank you so much for your reply. I found that there are many GNU parallel package for downloading. My OS is Linux Mint 18.1, so which one I should download? Thank you.

ADD REPLYlink
0
Entering edit mode

I can not help you much more than to say to install the latest version of Gnu parallel that is packaged for your operating system distribution.

probably install with:

sudo apt-get install parallel

but best to follow you OS documentation, possibly such as: Installing softwares

or, for hints, https://www.gnu.org/software/parallel/

ADD REPLYlink
0
Entering edit mode

I got it. Thank you so much.

ADD REPLYlink
0
Entering edit mode

Great - glad to help - please upvote and accept the answer!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3