I a new to illuminia data/analysis and just want to be sure I am processing correctly. I am using the below to pair each forward and reverse fastq to align with bwa. The output an is aligned sam file for sample, bit I am not sure what the messages in stdout mean or if thet are ok. The fastq files were not quality trimmed or cleaned up (adapter removal) before alignment and maybe that is part of it. Is there additional steps to incorporate in the analysis. Thank you :).
code
for file in /home/cmccabe/Desktop/fastq/*_R1_*.fastq
do
file2=$(echo $file | sed 's/_R1_/_R2_/')
sample=$(basename $file .fastq | cut -d- -f1)
/home/cmccabe/Desktop/fastq/bwa-0.7.17/bwa mem -M -t 16 -R "@RG\tID:$sample\tSM:$sample" /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta $file $file2 > /home/cmccabe/Desktop/fastq/${sample}_aln.sam
done
fastq input
NA12878-100ng-E08A-C06_S5_L001_R1_001.fastq
NA12878-100ng-E08A-C06_S5_L001_R2_001.fastq
NA19240-100ng-E08A-C06_S5_L001_R1_001.fastq
NA19240-100ng-E08A-C06_S5_L001_R2_001.fastq
output
NA12878_aln.sam
NA19240_aln.sam
stdout messages (there are many more but all similar)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (193, 211, 226)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (127, 292)
[M::mem_pestat] mean and std.dev: (210.23, 21.16)
[M::mem_pestat] low and high boundaries for proper pairs: (94, 325)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (49, 224, 971)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 2815)
Would you recommend any quality trimming, adapter removal, etc before alignment? Thank you for your help.
Trimming isn't strictly needed, since you're using local alignment. If you get decent alignment metrics then don't bother.