Hi all, I am trying to realign a whole genome BAM file from one reference genome to another. Because the reference genome is updated. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.
samtools sort -n input.bam -o input_n.sorted.bam
samtools fastq -1 output_r1.fq.gz -2 output_r2.fq.gz -0 output_0.fq input_n.sorted.bam
then map to ref genome:
bwa mem –t 12 -M -R "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1" reference.fa raw_read1.fq.gz raw_read2.fq.gz > aligned_reads.sam
samtools view -S -b aligned_reads.sam > aligned_reads.bam
samtools sort aligned_reads.bam –o aligned_reads_sorted.bam
- why the number of total reads decrease?
- Should I do fixmate before sorting (5) or not?
- Part of my data are mark duplicated bam files, I use the following process for them:
primary bam file:
103447197 + 0 in total (QC-passed reads + QC-failed reads)
437391 + 0 secondary
0 + 0 supplementary
763558 + 0 duplicates
102387760 + 0 mapped (98.98%:-nan%)
103009806 + 0 paired in sequencing
51504903 + 0 read1
51504903 + 0 read2
99225776 + 0 properly paired (96.33%:-nan%)
101576826 + 0 with itself and mate mapped
373543 + 0 singletons (0.36%:-nan%)
1967686 + 0 with mate mapped to a different chr
1253576 + 0 with mate mapped to a different chr (mapQ>=5)
new bam file:
103395186 + 0 in total (QC-passed reads + QC-failed reads)
385380 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
102673297 + 0 mapped (99.30% : N/A)
103009806 + 0 paired in sequencing
51504903 + 0 read1
51504903 + 0 read2
100273090 + 0 properly paired (97.34% : N/A)
102010764 + 0 with itself and mate mapped
277153 + 0 singletons (0.27% : N/A)
1151380 + 0 with mate mapped to a different chr
647241 + 0 with mate mapped to a different chr (mapQ>=5)
52011 reads have been lost
but lots of reads have been lost.
samtools flagstat for primary bam:
189134632 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
182864553 + 0 mapped (96.68% : N/A)
189134632 + 0 paired in sequencing
94520985 + 0 read1
94613647 + 0 read2
177873826 + 0 properly paired (94.05% : N/A)
181579511 + 0 with itself and mate mapped
1285042 + 0 singletons (0.68% : N/A)
3234229 + 0 with mate mapped to a different chr
2597267 + 0 with mate mapped to a different chr (mapQ>=5)
samtools view -c BL_72656_unmap_sort.bam
2817937
samtools view -c BL_72656_map_sort.bam
72385932
189134632 >>> (2817937+72385932)
Thanks in advance to anyone how can help me to understand if this procedure was right, I miss something, or just this is the output....
Cheers.
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Not the source of your issue, but for your information you can simplify the steps below:
To just this:
And as such save time, disk space and avoid intermediate files.
Thanks a lot for you kind help.