Question

Nested for loop in Bash

0

Entering edit mode

5.9 years ago

owenz ▴ 20

I am trying to write a simple Bash script emulating what our current external software at my lab does: Hepatitis C genotyping.

I want that for each FASTQ file input to the script, the script performs the aligning, indexing and produces a VCF file. I want that it then creates a consensus sequence. I was able to create this using 1 reference genome.

However, I wanted to introduce a degree of complexity which I'm not sure is possible with Bash. SInce there are 7 main HepC genotypes, I want to perform the aligning, indexing, VCF and consensus sequences with 7 genotypes at the same time (running in sync) so that at the end, I end up with 7 consensus sequences for each FASTQ file.

I then want to be able to compare each of these 7 consensus sequences to their respective references and obtain a % similarity score so that one can determine which is the most likely viral genotype present in the patient. I was thinking of using BLAST for this and awk to extract the necessary data.

I separated the scripts performing the alignment, indexing ect for each reference into their own script. Then I ran all these scripts inside a main script using a for loop like this

for SAMPLE_ID in $@
do
  source hcv1script.sh
  source hcv3script.sh
done

This gave me the consensus sequences for Genotype 1 only. How can I fix this? Do I need a nested loop?

next-gen sequencing bash loop • 3.6k views

ADD COMMENT • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

Thanks a lot for your help

ADD REPLY • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

If you can identify the post that helped solve this question please identify that and we can move it to an answer). You can then accept it to provide closure to this thread.

Note: It may be @cpad0112's original comment since there is a lot of nested material under there that must have been ultimately needed.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

Yes that is the post that solved it. Thanks

ADD REPLY • link 5.9 years ago by owenz ▴ 20

score 2 · Accepted Answer · 2018-05-28

2

Entering edit mode

5.9 years ago

cpad0112 21k

IF you can furnish more details, i would be able to understand the issue. Following is a simple example of bash nested loop:

Let us say you have 7 genotypes (from genotype 1 to genotype 2) and you want to publish each genotype 4 times, nested loop would be:

$ for i in genotype*; do for j in $(seq 1 4); do echo $i,$j; done ;done

output:

genotype1,1
genotype1,2
genotype1,3
genotype1,4
.
.
.
genotype7,1
genotype7,2
genotype7,3
genotype7,4

ADD COMMENT • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

My problem is to understand what variables go where in the code you posted.

SO basically my reference genome FASTA files would be in the outer loop right? and in the inner loop I would place my FASTQ results.

ADD REPLY • link 5.9 years ago by owenz ▴ 20

0

Entering edit mode

Outer loop is for genotypeN which is assigned to variable i. In the inner loop (j) you are assigning a number (1 to 4) for each iteration of the loop (see order of printing $i,$j). Thus you get,

genotype1,1
genotype1,2
genotype1,3
genotype1,4

in first iteration. Here you complete iteration of inner loop. You are back at outer loop. $i is incremented by 1 to 2 and you restart the inner $j loop which gives you

genotype2,1
genotype2,2
genotype2,3
genotype2,4

in second iteration and so on.

ADD REPLY • link 5.9 years ago by GenoMax 141k

0

Entering edit mode

If you can furnish with examples what you would like to do with fastq and fasta files, context of issue would be clear.

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

All the fasta files are the reference genomes used to align the fastq files with them

for SAMPLE_ID in $@
do
    #Index the aligner with the reference genome
    bwa index -a is HCV1.fasta
    # Align reads 1 of samples to reference genome
    bwa aln HCV1.fasta $SAMPLE_ID.1.fastq >$SAMPLE_ID.1.sai
    # Align reads 2 of samples to reference genome
    bwa aln HCV1.fasta $SAMPLE_ID.2.fastq >$SAMPLE_ID.2.sai
    # Merge and change into SAM format for both samples
    bwa sampe HCV1.fasta $SAMPLE_ID.1.sai $SAMPLE_ID.2.sai $SAMPLE_ID.1.fastq $SAMPLE_ID.2.fastq > $SAMPLE_ID.sam
    # Change the SAM to BAM format for both samples
    samtools view -bt HCV1.fasta $SAMPLE_ID.sam>$SAMPLE_ID.bam
    # Sort the BAM files
    samtools sort $SAMPLE_ID.bam -o $SAMPLE_ID.sorted.bam
    # Create the index file for the BAM files
    samtools index -b  $SAMPLE_ID.sorted.bam

#Create mpileup file and VCF file containing SNPs
samtools mpileup -IEuDSf HCV1.fasta $SAMPLE_ID.sorted.bam | bcftools view -v snps - > $SAMPLE_ID.vcf

#Compress the VCF with tabix and its index
bgzip -c $SAMPLE_ID.vcf > $SAMPLE_ID.vcf.gz
tabix -p vcf $SAMPLE_ID.vcf.gz

# Create a consensus sequence in FASTA format from newly-created VCF and the reference sequence
cat HCV1.fasta | bcftools consensus $SAMPLE_ID.vcf.gz > $SAMPLE_ID-consensus.fa
done

I was thinking of cutting the code in between the for loop and repeating it 7 times in separate scripts, then refer to them in a main script like so

source hcv1script.sh
source hcv2script.sh

...

However, I think it seems inefficient and writing it in a nested loop is simpler.

Hope this helps give some more info

ADD REPLY • link 5.9 years ago by owenz ▴ 20

2

Entering edit mode

General comment, don't source scripts unless you need to set environment variables etc.

If its a shell script, bash myscript.sh will be sufficient (or whatever your shell is).

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

Thank you for the reply. Let me understand few more points here:

HCV1.fasta is one single genotype fasta. Like wise you have 7 fasta one for each genotype
Are these fastq files are same for each genotype or different i.e each genotype will be aligned against same set of reads (for eg. HCV1 fasta will be aligned against a.r1.fastq and a.r2.fastq, HCV2 fasta will be algined against a.r1.fastq and a.r2.fastq) or each genotype will be aligned against different sets of reads (for eg. HCV1 fasta will be aligned against a.r1.fastq and a.r2.fastq, HCV2 fasta will be algined against b.r1.fastq and b.r2.fastq)?

ADD REPLY • link 5.9 years ago by cpad0112 21k

2

Entering edit mode

let me assume that there are 7 fasta files for each genotype and they will be aligned against same set of reads (in this case (test.R1.fq.gz, test1.R1.fq.gz. test2.R1.fq.gz). Let me take one simple command which is bwa align. Now with a nested loop all 7 genotypes will be used for all the reads (test,test1,test2) alignment ( I added echo to check the final commands).

$ for i in hcv*.fa; do for j in test*.R1.*; do echo "bwa aln $i $j > ${j%.fq.*}.sai" ; done; done

bwa aln hcv_1.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_1.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_1.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_2.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_2.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_2.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_3.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_3.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_3.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_4.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_4.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_4.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_5.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_5.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_5.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_6.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_6.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_6.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_7.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_7.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_7.fa test.R1.fq.gz > test.R1.sai

However, IMO, loops are serial. For parallel execution, either you use frameworks such as snakemake, make, nextflow and also use parallel tools such as gnu-parallel.

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

The fasta files (7 distinct genotypes) will be fixed.

The fastq files are different all the time

ADD REPLY • link 5.9 years ago by owenz ▴ 20

2

Entering edit mode

If you have 7 fully contained scripts and want to execute in parallel, you can try using xargs: (assuming that scripts are nameed script 1 to script 7.sh)

$ ls script{1..7}.sh | xargs -n 1 -P 0 bash

ADD REPLY • link 5.9 years ago by cpad0112 21k

0

Entering edit mode

I see. Can you explain what do the xargs flags do in this case?

ADD REPLY • link 5.9 years ago by owenz ▴ 20

1

Entering edit mode

Useful website for explaining shell flags: explainshell

For this particular question of xargs flags see here

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

1

Entering edit mode

xargs is system command. In this case, it is passing all the scripts to bash. -P is parallel execution. -n 1 means xargs sending one input (in this case one script) at a time. Since 0 is passed to -P argument, it uses maximum processes.

ADD REPLY • link 5.9 years ago by cpad0112 21k