splitting a SAM file
2
1
Entering edit mode
5.0 years ago
Ric ▴ 430

I wrote the below script which takes a SAM file as input and creates for each contig a SAM and FASTA file:

#!/usr/bin/env python3

import click
import os

@click.command()
@click.option('--sam', help="SAM files", required=True)
@click.option('--output', help="Output directory", required=True)
def retrieve_contig_reads(sam, output):

    with open(sam) as align:
        for line in align:
            try:
                parts = line.rstrip().split('\t')
                illumina_id = parts[0]
                Illumina_seq= parts[9]
                contig_name = parts[2]

                with open(os.path.join(output, contig_name + ".fasta"), 'a') as fasta:
                    fasta.write(">" + illumina_id + '\n')
                    fasta.write(Illumina_seq + '\n')

                with open(os.path.join(output, contig_name + ".sam"), 'a') as sam_line:
                    sam_line.write(line + '\n')


            except IndexError:
                continue

if __name__ == '__main__':

    retrieve_contig_reads()

Unfortunately, it ran for more than 2 weeks and has still not completed. Is there a faster way to retrieve the alignment and the reads for each contig?

Thank you in advance,

alignment • 2.3k views
ADD COMMENT
1
Entering edit mode

your code seems relatively* efficient, no idea why it would run 2 weeks. The only thing might be the file open overhead for writing each line. You can certainly avoid that by sorting the bam/sam file before hand (samtools sort). This will sort all reads by alignment position and is a de facto standard required by many downstream programs anyway. After that, you can essentially open a new file once you see a new contig_name. Also look into the pysam module, this might be helpful, too.

*Edit: putting a bit more emphasis on relative, keeping Pierre's comment below in mind.

ADD REPLY
1
Entering edit mode

i'm not a python guy, but unless I'm wrong you're opening a new file (append mode) for every SAM record ? Opening a file costs a lot of I/O.

ADD REPLY
6
Entering edit mode
5.0 years ago

I wouldn't use python for this...

samtools index alignment.bam
samtools idxstats alignment.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam alignment.bam {} && samtools fasta {}.bam | gzip > {}.fasta.gz'
  1. Create an index of a your bam file
  2. using idxstats, cut and grep -v get all contigs
  3. use gnu parallel to parallelize processes, adjust -j accordingly to your system
  4. for each contig: use samtools view to get the contig in a bam file, and samtools fasta to get the reads in fasta format
ADD COMMENT
0
Entering edit mode

I ran this commands:

samtools sort output.gfa1.onlymapped.sam -o output.gfa1.onlymapped.sorted.bam
samtools index output.gfa1.onlymapped.sorted.bam
samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.sam output.gfa1.onlymapped.sorted.sam {} && samtools fasta {}.sam > {}.fasta'

Unfortunately, all sam files are empty and no FASTA files have been generated. Please find below the log file:

[bam_sort_core] merging from 370 files and 1 in-memory blocks...
^M^M#   0 sec Backbone_8                                                            ^M0^M0% 0:4947=0s Backbone_8                                                                                                                 [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.

...

99% 4946:1=0s Backbone_4947                                                     [main_samview] random alignment retrieval only works for indexed BAM or CRAM files.
100% 4947:0=0s Backbone_4947

What did I miss?

Thank you in advance,

ADD REPLY
0
Entering edit mode

First you have to run parallel with --dryrun and see if the comments which are printed are what you would expect.

Something that you could try (but is not what you originally asked for) is to write to bam in the parallel jobs, and also adjust the second part then which creates the fasta file. You may have to index the bam file first, so you could add that too:

samtools idxstats output.gfa1.onlymapped.sorted.bam | cut -f 1 | grep -v '*' | parallel -j 8 --bar 'samtools view -o {}.bam output.gfa1.onlymapped.sorted.sam {} && samtools index {}.bam && samtools fasta {}.bam > {}.fasta'
ADD REPLY
0
Entering edit mode

I used your command but the files are still empty. The dryrun function created e.g. this command samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1 && samtools index Backbone_1.bam && samtools fasta Backbone_1.bam > Backbone_1.fasta. What did I miss?

ADD REPLY
0
Entering edit mode

for trouble shooting splite the above command in its elements, for example:

samtools view -o Backbone_1.bam output.gfa1.onlymapped.sorted.sam Backbone_1

I'm pretty sure the Backbone_1 at the end is unnecessary

ADD REPLY
0
Entering edit mode

Backbone_1 is a contig name in this case, so it is necessary to split the alignment per chromosome.

ADD REPLY
0
Entering edit mode

the input in that samtools view is a sam file output.gfa1.onlymapped.sorted.sam. The error you received is random alignment retrieval only works for indexed BAM or CRAM files.. That looks clear to me.

ADD REPLY
0
Entering edit mode
5.0 years ago
Carambakaracho ★ 3.2k

Exactly, as stated above with open(file) as f: reopens a file for each line, and removing this will increase performance, but I'd be surprised if this will increase the running time dramatically, like 2 hours instead of 2 weeks. There's something wrong in addition, but my first guess was some sort of I/O bottleneck, too

ADD COMMENT

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6