Question

Automating a SNP calling workflow

0

Entering edit mode

5.4 years ago

zhou_1228 • 0

I wrote a program to automate SNP calling process, but not sure if it works since I am new to shell scricpts. If I can get some review for the scripts, I would highly appreciate your help. The below is scripts,

cd ~/results

genome=~/home/administrator/GlycinMaxNew.fasta

mkdir -p bwa sam bam vcf

for fq in ~/Tilling-by-seq/*.fastq
do
echo "working with file $fq"

base=$(basename $fq .fastq)
echo "base name is $base"

fq=~/home/administrator/Tilling-by-seq/${base}.3L_F.fastq
fq=~/home/administrator/Tilling-by-seq/${base}.3L_R.fastq
bwa=~/home/administrator/results/bwa/${base}_F.bwa
bwa=~/home/administrator/results/bwa/${base}_R.bwa
sam=~/home/administratorresults/sam/${base}_sam
bam=~/home/administrator/results/bam/${base}_bam
sorted_bam=~/home/administrator/results/bam/${base}_sorted.bam
vcf=~/home/administrator/results/vcf/${base}_vcf 

bwa aln -t 4 GlycinMaxNewbwaidx $3L_F_fq > $F_bwa
bwa aln -t 4 GlycinMaxNewbwaidx $3L_R_fq > $R_bwa
bwa sampe GlycinMaxNewbwaidx $F_bwa $R_bwa $3L_F_fq $3L_R_fq > $sam
samtools view -S -b $sam > $bam
samtools sort $bam -o $sorted_bam
samtools index $sorted_bam
freebayes -f $genome -p 24 --use-best-n-alleles 4 --pooled-discrete $sorted_bam > $vcf
done

SNP next-gen sequencing • 2.0k views

ADD COMMENT • link updated 5.4 years ago by msimmer92 ▴ 300 • written 5.4 years ago by zhou_1228 • 0

3

Entering edit mode

I think you should run what you wrote and put the result/errors here besides the code, so people can help. But first; You are here overwrite the variable fq

fq=~/home/administrator/Tilling-by-seq/${base}.3L_F.fastq    
fq=~/home/administrator/Tilling-by-seq/${base}.3L_R.fastq

same here;

bwa=~/home/administrator/results/bwa/${base}_F.bwa
bwa=~/home/administrator/results/bwa/${base}_R.bwa

I would like also to introduce you to workflow management, like snakemake. Also, there is other workflow management tools/language that will help you to be more productive.

ADD REPLY • link 5.4 years ago by Medhat 9.7k

score 2 · Answer 1 · 2018-11-07

Your question for feedback about your code because you're new to shell scripts, instead of a specific question. I will, then, give some general tips based on this (that you may know or not at this point). Nevertheless, as Medhat pointed out, in this forum it's better to post more specific coding errors/questions.

As @Medhat commented on your post, beyond the thing about overwriting your variables, you could download workflow management programs that were design specifically to guide you throughout this process of pipeline making. Snakemake, Rufus, there are a bunch of them (see How To Organize A Pipeline Of Small Scripts Together? and the review paper: A review of bioinformatic pipeline frameworks. J Leipzig, Briefings in Bioinformatics, 2016, 1–7). In my lab, we haven't use this so far; we use text editors to visualize our pipelines nicely while we write, such as Visual Studio Code (my personal favorite), Emacs and Gedit depending on the person, and for now we're fine with this.
In my lab, at least, we consider it's good practice to always call variables in Bash with curly brackets (as you did with ${base}), because it prevents potential mistakes that can happen (see https://stackoverflow.com/questions/8748831/when-do-we-need-curly-braces-around-shell-variables). I recommend you do it, since I see you could benefit from that.
Taking automatization to the next level. To create a shell script that is as autonomous as possible, try to implement the concept of arguments. The idea is to make the code as general as possible to be able to be used plenty of times (avoid "hardcoding", that is, a lot of specific options and file paths or parameters that you typed in your code, but don't work for other files/cases). In the world of pipeline development, this is very important and will make your code mature significantly. To understand what I mean and how this might be useful for you, please read more about this here: http://linuxcommand.org/lc3_wss0120.php.
The choice of programming language. Certain ones, such as Bash and Python, are good as the "glue" that puts everything together (in other words, to make the principal code and then call programs and sub-scripts from there). This main code will be called, in programming jargon, a "wrapper". For bioinformatics analyses such as yours, a lot of great R packages are the gold standard for certain analyses. Because of this, it is likely that you end up creating a wrapper plus one or more R scripts that will be called by that wrapper (note: it is not good practice to use R for pipeline making - it is good that you're using Bash). It sounds trivial, but I point this out because some people try to use R because it's the first language they know (see How to write effective and stable bioinformatics pipeline in R ? for more opinions). In terms of which programming language is best, this is up to you, your field of study, your project and what it is used in your lab. For certain fields, like machine learning, you might need python because of the size of certain data or calculations. For gluing together programs in an average bioinformatics pipeline, such as SNP calling or differential expression analysis, whichever language you feel comfortable with and allows you to do a neat work is fine (you'll end up developing your own style, after a while :) that will be shaped by these decisions). It could happen that you face a new project and you decide to learn a new one and switch to be able to solve that particular situation, so keep an open mind and get minimally informed about the potentialities of other languages to be able to recognize when you might need another one.
Comment your code accordingly. The code is read by the computer, but also by humans. Even though your goal should be that the code is as self-explanatory as possible, comments are useful for both, yourself and others, specially in bioinformatics pipelines that people from different backgrounds will probably be using (not everyone has the ability to understand code). I recommend you visiting https://github.com/ , where you could see a lot of code and pipelines that are published by professionals, to get ideas regarding the amount (and type) of comments that is reasonable (too much or too little can be a problem).

Happy coding!