I wrote the below script which takes a SAM file as input and creates for each contig a SAM and FASTA file:
#!/usr/bin/env python3
import click
import os
@click.command()
@click.option('--sam', help="SAM files", required=True)
@click.option('--output', help="Output directory", required=True)
def retrieve_contig_reads(sam, output):
with open(sam) as align:
for line in align:
try:
parts = line.rstrip().split('\t')
illumina_id = parts[0]
Illumina_seq= parts[9]
contig_name = parts[2]
with open(os.path.join(output, contig_name + ".fasta"), 'a') as fasta:
fasta.write(">" + illumina_id + '\n')
fasta.write(Illumina_seq + '\n')
with open(os.path.join(output, contig_name + ".sam"), 'a') as sam_line:
sam_line.write(line + '\n')
except IndexError:
continue
if __name__ == '__main__':
retrieve_contig_reads()
Unfortunately, it ran for more than 2 weeks and has still not completed. Is there a faster way to retrieve the alignment and the reads for each contig?
Thank you in advance,
your code seems relatively* efficient, no idea why it would run 2 weeks. The only thing might be the file open overhead for writing each line. You can certainly avoid that by sorting the bam/sam file before hand (
samtools sort
). This will sort all reads by alignment position and is a de facto standard required by many downstream programs anyway. After that, you can essentially open a new file once you see a new contig_name. Also look into thepysam
module, this might be helpful, too.*Edit: putting a bit more emphasis on relative, keeping Pierre's comment below in mind.
i'm not a python guy, but unless I'm wrong you're opening a new file (append mode) for every SAM record ? Opening a file costs a lot of I/O.