Question: paired-end sequence alignment
0
Entering edit mode
3 months ago
evelyn • 30

I am trying to run bwa for multiple paired end fastq files, located in a folder named sample_data. I am using this code to get .sorted.bam files to a new folder called output:

total_files=`ls ./sample_data/*.fastq | wc -l`
arr=( $(ls ./sample_data/*.fastq) )
for ((i=0; i<$total_files; i+=2))
{
sample_name=`echo ${arr[$i]} | awk -F "_" '{print $1}'`
echo "[mapping running for] $sample_name"
printf "\n"
bwa mem -t 12 ref.fa ${arr[$i]} ${arr[$i+1]} | samtools sort -o output/$sample_name.sorted.bam

But the echo says

[mapping running for] ./sample

And got only one sorted.bam file named sample.sorted.bam It is not giving the output files named as input files. Any suggestion is appreciated.

ADD COMMENTlink 3 months ago evelyn • 30
1
Entering edit mode
3 months ago
SMK ♦ 1.3k
Ghent, Belgium

Hi evelyn,

When $i=0, ${arr[$i]} is ./sample_data/a1_R1.fastq. So awk -F "_" will divide ${arr[$i]} into ./sample, data/a1, and R1.fastq (separated by _). You can change from:

sample_name=`echo ${arr[$i]} | awk -F "_" '{print $1}'`

to

sample_name=`basename ${arr[$i]} | awk -F "_" '{print $1}'`

Which pipes a1_R1.fastq to awk instead of ./sample_data/a1_R1.fastq. You can also use basename to get a1 directly, like this:

for R1 in sample_data/*_R1.fastq; do
  sample_name=$(basename $R1 _R1.fastq)
  R2=$(echo $R1 | sed 's/_R1/_R2/')
  echo "[mapping running for] $sample_name"
  bwa mem -t 12 ref.fa $R1 $R2 | samtools sort -o output/$sample_name.sorted.bam
done

And suggestions on Command Substitution and Loops. :-)

ADD COMMENTlink 3 months ago SMK ♦ 1.3k
Entering edit mode
0

It worked, thank you!

ADD REPLYlink 3 months ago
evelyn
• 30
Entering edit mode
0

I am trying to add parallel in the array command like this:

parallel -j2 'bwa mem -t 12 ref.fa {$i} {$i+1} | samtools sort -o output/$sample_name.sorted.bam ::: ${arr[$i]} ::: ${arr[$i+1]}'

But it does not run.

ADD REPLYlink 3 months ago
evelyn
• 30
Entering edit mode
0

Hi evelyn,

parallel is in the next discussion. :-) I think ATpoint meant something like (modifying from the link he provided):

function BWA {
  INDEX=$1
  BASENAME=$2
  bwa mem -t 12 "${INDEX}" "${BASENAME}"_R1.fastq "${BASENAME}"_R2.fastq | samtools sort -o "${BASENAME}".sorted.bam
}; export -f BWA

ls sample_data/*_R1.fastq | awk -F "_R1.fastq" '{print $1}' | parallel "BWA ref.fa {}"

And for example, if you have 24 threads that can be used, you can add -j 2 following parallel. That will map two sets of reads simultaneously (number of jobs at the same time), 12 threads for each.

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Hi,

Thank you! Actually I am trying to run parallel using the same array as above,

total_files=`ls ./sample_data/*.fastq | wc -l`
arr=( $(ls ./sample_data/*.fastq) )
for ((i=0; i<$total_files; i+=2))
{
sample_name=`basename ${arr[$i]} | awk -F "_" '{print $1}'`
echo "[mapping running for] $sample_name"
printf "\n"
parallel 'bwa mem -t 12 ref.fa {$i} {$i+1} | samtools sort -o output/$sample_name.sorted.bam ::: ${arr[$i]} ::: ${arr[$i+1]}'

I did not want to change the entire approach.

ADD REPLYlink 3 months ago
evelyn
• 30
Entering edit mode
0

If you're still looping the array, what's parallel for?

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

Looping takes the multiple inputs at once but does not run them all simultaneously. So it gives sorted bam file from bwa for one input paired end file and then next.

ADD REPLYlink 3 months ago
evelyn
• 30
Entering edit mode
1

I see. Sorry for not considering the possibility.

You can use the loop to output the commands to a file instead of executing them, for example, do:

echo "bwa mem -t 12 ref.fa ${arr[$i]} ${arr[$i+1]} | samtools sort -o output/$sample_name.sorted.bam" >> bwa.cmds

In your case, bwa.cmds should be something like:

bwa mem -t 12 ref.fa ./sample_data/a1_R1.fastq ./sample_data/a1_R2.fastq | samtools sort -o output/a1.sorted.bam
bwa mem -t 12 ref.fa ./sample_data/a_R1.fastq ./sample_data/a_R2.fastq | samtools sort -o output/a.sorted.bam

Then you can do: cat bwa.cmds | parallel -j 2 to run two jobs simultaneously.

ADD REPLYlink 3 months ago
SMK
♦ 1.3k
Entering edit mode
0

It worked perfectly, thank you :)

ADD REPLYlink 3 months ago
evelyn
• 30
0
Entering edit mode
3 months ago
ATpoint 17k
Germany

I recommend writing a simply function for taks like this and then use GNU parallel to parallelize it over multiple input files, see A: perl script for BWA-mem on multiple different files. If you want one file at a time, simply set -j 1 in GNU parallel. That saves you nasty loops.

ADD COMMENTlink 3 months ago ATpoint 17k
Entering edit mode
0

I will give it a try. thanks!

ADD REPLYlink 3 months ago
evelyn
• 30
This thread is not open. No new answers may be added
Powered by the version 1.5