Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

Converting mapped + unmapped BAM files to raw FASTQ (RNA-seq data)
1
0
Entering edit mode
2.6 years ago
rodd ▴ 130

Hi all,

I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.

However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.

I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.

Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:

$samtools flagstat concatenated.bam 80893332 + 28760 in total (QC-passed reads + QC-failed reads) 5509466 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 74608978 + 0 mapped (92.23% : 0.00%) 75383866 + 28760 paired in sequencing 37950442 + 14107 read1 37433424 + 14653 read2 34757340 + 0 properly paired (46.11% : 0.00%) 65502368 + 0 with itself and mate mapped 3597144 + 0 singletons (4.77% : 0.00%) 723510 + 0 with mate mapped to a different chr 429114 + 0 with mate mapped to a different chr (mapQ>=5)  BAM RNA-Seq fastq conversion bam2fastq • 1.4k views ADD COMMENT 1 Entering edit mode Hello rodd , I would just merge the mapped.bam and unmapped.bam, sort the resulting file by read name using samtools sort -n and extract the reads to fastq using samtools fastq merged_name_sorted.bam|bgzip -c > all.fastq.gz. See also my issue on samtools github. fin swimmer ADD REPLY 0 Entering edit mode Hi finswimmer, Thanks you for your prompt response, and for the link to your post on samtools github. I will be following your advice (and the advice from our colleague who also responded to the thread). But just out of curiosity, I am still finding some discrepancies in the number of reads after converting to fastq. See below number of reads in bam file, and reads in my fastq files: $ samtools view -c -F 0x100 merged_sorted.bam        # duplications not allowed
**75,412,626**
$samtools fastq -0 merged_sorted_0.fastq -1 merged_sorted_1.fastq -2 merged_sorted_2.fastq -F 0x100 merged_sorted.bam [M::bam2fq_mainloop] discarded 0 singletons [M::bam2fq_mainloop] processed **75,412,626 reads**$ wc -l merged_sorted_1.fq
148988112


  148988112 /4 = 37,247,028 reads per file
37247028 * 2 for paired reads = **74,494,056**


So I have 918,570 reads missing in the fastq files (and they are not in samtools fastq -0 output or -s singletons_output).

0
Entering edit mode

Your second command isn't filtering out secondary alignments. Don't you want to do that?

0
Entering edit mode

Sorry, that was a typo - I did include it in my command, and have updated my previous reply.

I am only outputting the reads I want to the FASTQ files, which is great. But I am still curious as to why it's removing ~1 mi reads after the BAM-FASTQ conversion, when comparing to the output of samtools view -c -F 0x100 merged_sorted.bam.

1
Entering edit mode

Count up how often each flag turns up in your bam. Finswimmer's link suggests that that -F 0x900 is turned on whether you want it or not, so maybe that's where your million reads are going.

1
Entering edit mode
2.6 years ago

I'd do

samtools fastq -F 256 mydata.bam > mydata.fastq


This will discard secondary alignments, so each read, including unmapped reads, will only turn up once in your fastq.

0
Entering edit mode

Thank you! (Just for the record, this gave the same number of reads as when I used -F 0x900 or -F 0x100, as suggested by finswimmer. These flags are so confusing...)

1
Entering edit mode

0x100 is hexadecimal for 256. 0x900 is also filtering for supplementary alignments, you probably don't have any of those.

0
Entering edit mode

Thank you for your answers, it helped me a lot!