I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.
However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.
I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.
Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:
$ samtools flagstat concatenated.bam
80893332 + 28760 in total (QC-passed reads + QC-failed reads)
5509466 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
74608978 + 0 mapped (92.23% : 0.00%)
75383866 + 28760 paired in sequencing
37950442 + 14107 read1
37433424 + 14653 read2
34757340 + 0 properly paired (46.11% : 0.00%)
65502368 + 0 with itself and mate mapped
3597144 + 0 singletons (4.77% : 0.00%)
723510 + 0 with mate mapped to a different chr
429114 + 0 with mate mapped to a different chr (mapQ>=5)