Nested for loop in Bash
1
0
Entering edit mode
5.9 years ago
owenz ▴ 20

I am trying to write a simple Bash script emulating what our current external software at my lab does: Hepatitis C genotyping.

I want that for each FASTQ file input to the script, the script performs the aligning, indexing and produces a VCF file. I want that it then creates a consensus sequence. I was able to create this using 1 reference genome.

However, I wanted to introduce a degree of complexity which I'm not sure is possible with Bash. SInce there are 7 main HepC genotypes, I want to perform the aligning, indexing, VCF and consensus sequences with 7 genotypes at the same time (running in sync) so that at the end, I end up with 7 consensus sequences for each FASTQ file.

I then want to be able to compare each of these 7 consensus sequences to their respective references and obtain a % similarity score so that one can determine which is the most likely viral genotype present in the patient. I was thinking of using BLAST for this and awk to extract the necessary data.

I separated the scripts performing the alignment, indexing ect for each reference into their own script. Then I ran all these scripts inside a main script using a for loop like this

for SAMPLE_ID in $@
do
  source hcv1script.sh
  source hcv3script.sh
done

This gave me the consensus sequences for Genotype 1 only. How can I fix this? Do I need a nested loop?

next-gen sequencing bash loop • 3.6k views
ADD COMMENT
0
Entering edit mode

Thanks a lot for your help

ADD REPLY
0
Entering edit mode

If you can identify the post that helped solve this question please identify that and we can move it to an answer). You can then accept it to provide closure to this thread.

Note: It may be @cpad0112's original comment since there is a lot of nested material under there that must have been ultimately needed.

ADD REPLY
0
Entering edit mode

Yes that is the post that solved it. Thanks

ADD REPLY
2
Entering edit mode
5.9 years ago

IF you can furnish more details, i would be able to understand the issue. Following is a simple example of bash nested loop:

Let us say you have 7 genotypes (from genotype 1 to genotype 2) and you want to publish each genotype 4 times, nested loop would be:

$ for i in genotype*; do for j in $(seq 1 4); do echo $i,$j; done ;done

output:

genotype1,1
genotype1,2
genotype1,3
genotype1,4
.
.
.
genotype7,1
genotype7,2
genotype7,3
genotype7,4
ADD COMMENT
0
Entering edit mode

My problem is to understand what variables go where in the code you posted.

SO basically my reference genome FASTA files would be in the outer loop right? and in the inner loop I would place my FASTQ results.

ADD REPLY
0
Entering edit mode

Outer loop is for genotypeN which is assigned to variable i. In the inner loop (j) you are assigning a number (1 to 4) for each iteration of the loop (see order of printing $i,$j). Thus you get,

genotype1,1
genotype1,2
genotype1,3
genotype1,4

in first iteration. Here you complete iteration of inner loop. You are back at outer loop. $i is incremented by 1 to 2 and you restart the inner $j loop which gives you

genotype2,1
genotype2,2
genotype2,3
genotype2,4

in second iteration and so on.

ADD REPLY
0
Entering edit mode

If you can furnish with examples what you would like to do with fastq and fasta files, context of issue would be clear.

ADD REPLY
0
Entering edit mode

All the fasta files are the reference genomes used to align the fastq files with them

for SAMPLE_ID in $@
do
    #Index the aligner with the reference genome
    bwa index -a is HCV1.fasta
    # Align reads 1 of samples to reference genome
    bwa aln HCV1.fasta $SAMPLE_ID.1.fastq >$SAMPLE_ID.1.sai
    # Align reads 2 of samples to reference genome
    bwa aln HCV1.fasta $SAMPLE_ID.2.fastq >$SAMPLE_ID.2.sai
    # Merge and change into SAM format for both samples
    bwa sampe HCV1.fasta $SAMPLE_ID.1.sai $SAMPLE_ID.2.sai $SAMPLE_ID.1.fastq $SAMPLE_ID.2.fastq > $SAMPLE_ID.sam
    # Change the SAM to BAM format for both samples
    samtools view -bt HCV1.fasta $SAMPLE_ID.sam>$SAMPLE_ID.bam
    # Sort the BAM files
    samtools sort $SAMPLE_ID.bam -o $SAMPLE_ID.sorted.bam
    # Create the index file for the BAM files
    samtools index -b  $SAMPLE_ID.sorted.bam

#Create mpileup file and VCF file containing SNPs
samtools mpileup -IEuDSf HCV1.fasta $SAMPLE_ID.sorted.bam | bcftools view -v snps - > $SAMPLE_ID.vcf

#Compress the VCF with tabix and its index
bgzip -c $SAMPLE_ID.vcf > $SAMPLE_ID.vcf.gz
tabix -p vcf $SAMPLE_ID.vcf.gz

# Create a consensus sequence in FASTA format from newly-created VCF and the reference sequence
cat HCV1.fasta | bcftools consensus $SAMPLE_ID.vcf.gz > $SAMPLE_ID-consensus.fa
done

I was thinking of cutting the code in between the for loop and repeating it 7 times in separate scripts, then refer to them in a main script like so

source hcv1script.sh
source hcv2script.sh

...

However, I think it seems inefficient and writing it in a nested loop is simpler.

Hope this helps give some more info

ADD REPLY
2
Entering edit mode

General comment, don't source scripts unless you need to set environment variables etc.

If its a shell script, bash myscript.sh will be sufficient (or whatever your shell is).

ADD REPLY
0
Entering edit mode

Thank you for the reply. Let me understand few more points here:

  1. HCV1.fasta is one single genotype fasta. Like wise you have 7 fasta one for each genotype
  2. Are these fastq files are same for each genotype or different i.e each genotype will be aligned against same set of reads (for eg. HCV1 fasta will be aligned against a.r1.fastq and a.r2.fastq, HCV2 fasta will be algined against a.r1.fastq and a.r2.fastq) or each genotype will be aligned against different sets of reads (for eg. HCV1 fasta will be aligned against a.r1.fastq and a.r2.fastq, HCV2 fasta will be algined against b.r1.fastq and b.r2.fastq)?
ADD REPLY
2
Entering edit mode

let me assume that there are 7 fasta files for each genotype and they will be aligned against same set of reads (in this case (test.R1.fq.gz, test1.R1.fq.gz. test2.R1.fq.gz). Let me take one simple command which is bwa align. Now with a nested loop all 7 genotypes will be used for all the reads (test,test1,test2) alignment ( I added echo to check the final commands).

$ for i in hcv*.fa; do for j in test*.R1.*; do echo "bwa aln $i $j > ${j%.fq.*}.sai" ; done; done

bwa aln hcv_1.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_1.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_1.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_2.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_2.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_2.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_3.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_3.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_3.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_4.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_4.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_4.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_5.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_5.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_5.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_6.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_6.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_6.fa test.R1.fq.gz > test.R1.sai
bwa aln hcv_7.fa test1.R1.fq.gz > test1.R1.sai
bwa aln hcv_7.fa test2.R1.fq.gz > test2.R1.sai
bwa aln hcv_7.fa test.R1.fq.gz > test.R1.sai

However, IMO, loops are serial. For parallel execution, either you use frameworks such as snakemake, make, nextflow and also use parallel tools such as gnu-parallel.

ADD REPLY
0
Entering edit mode

The fasta files (7 distinct genotypes) will be fixed.

The fastq files are different all the time

ADD REPLY
2
Entering edit mode

If you have 7 fully contained scripts and want to execute in parallel, you can try using xargs: (assuming that scripts are nameed script 1 to script 7.sh)

$ ls script{1..7}.sh | xargs -n 1 -P 0 bash
ADD REPLY
0
Entering edit mode

I see. Can you explain what do the xargs flags do in this case?

ADD REPLY
1
Entering edit mode

Useful website for explaining shell flags: explainshell

For this particular question of xargs flags see here

ADD REPLY
1
Entering edit mode

xargs is system command. In this case, it is passing all the scripts to bash. -P is parallel execution. -n 1 means xargs sending one input (in this case one script) at a time. Since 0 is passed to -P argument, it uses maximum processes.

ADD REPLY

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6