Good morning,
I have multiple fastq.gz file generated using ONT and belonging to one sample. Each fastq.gz file is composed of many reads, and each read could belong to gene 1 or gene 2, or gene 3.
I have the reference sequence for each gene.
I want to classify each read and determining if it belongs to gene 1, gene 2, or gene 3.
The goal is to generate three fastq files: one fastq.gz file for all the reads that have gene 1, a second fastq file for all the reads that have gene 2, and a third fastq file for all the reads that have gene 3.
I am considering using:
- 1/ minimap2 to align the reads from each FASTQ file to the reference sequences for gene 1, gene 2, and gene 3 separately.
- 2/ Converting the resulting SAM to BAM files to FASTQ format using samtools, filtering the reads based on their alignment to each gene.
- 3/ Compress the resulting FASTQ files to FASTQ.gz format.
This is the script that I'm managing to use :
#!/bin/bash
#### Define input and output folders
input_dir="/path/to/input/fastq_files"
output_dir="/path/to/output"
#### path to reference sequences for gene 1, gene 2, and gene 3
gene1_ref="/path/to/reference_gene1.fasta"
gene2_ref="/path/to/reference_gene2.fasta"
gene3_ref="/path/to/reference_gene3.fasta"
#### Function to align reads to references and extract reads for each gene seperately
classify_reads() {
local input_fastq="$1"
local output_prefix="$2"
### align reads to reference sequences
minimap2 -ax map-ont "$gene1_ref" "$input_fastq" > "$output_prefix"_gene1.sam
minimap2 -ax map-ont "$gene2_ref" "$input_fastq" > "$output_prefix"_gene2.sam
minimap2 -ax map-ont "$gene3_ref" "$input_fastq" > "$output_prefix"_gene3.sam
### Convert SAM to BAM and sort
samtools view -Sb "$output_prefix"_gene1.sam | samtools sort -o "$output_prefix"_gene1.sorted.bam
samtools view -Sb "$output_prefix"_gene2.sam | samtools sort -o "$output_prefix"_gene2.sorted.bam
samtools view -Sb "$output_prefix"_gene3.sam | samtools sort -o "$output_prefix"_gene3.sorted.bam
### Converting BAM to FASTQ
samtools fastq "$output_prefix"_gene1.sorted.bam | gzip > "$output_prefix"_gene1.fastq.gz
samtools fastq "$output_prefix"_gene2.sorted.bam | gzip > "$output_prefix"_gene2.fastq.gz
samtools fastq "$output_prefix"_gene3.sorted.bam | gzip > "$output_prefix"_gene3.fastq.gz
### delete intermediate files
rm "$output_prefix"_gene*.sam
rm "$output_prefix"_gene*.sorted.bam
}
### Process each input FASTQ file
for fastq_file in "$input_dir"/*.fastq.gz; do
filename=$(basename "$fastq_file" .fastq.gz)
output_prefix="$output_dir"/"$filename"
classify_reads "$fastq_file" "$output_prefix"
done
what do you thing about my strategy ? does it correct ?
thank you very much for all these informations,
but I need to generate 3 fastq.gz files , one file per gene.
based on this constat : my first strategy is complete wrong or I can use it in order to generates these 3 files?
You can use any of the methods indicated, and then split the original fastq based on what sequences hit each of the references. An adaptation of your initial method would work, but will be relatively slow and clunky compared to the
mmseqs
idea. But if you're not processing a lot of data it wouldn't make much difference overall. If you are going to adapt your original idea, merge all reference sequences into a single file to avoid the only major problem I see.