Is my file created completely
1
0
Entering edit mode
6.9 years ago
micro32uvas ▴ 10

Hello,

I ran a command for the generation of SAM file from my fastq data. Some one just turned my system off. Its >70GB .sam file created. How can I know that the sam file is created completely.

The file isn't corrupt and opens up, but I am afraid it might not be completed, since the terminal is also closed.

Your help is desired here.

Regards

sam BWA-MEM BWA • 2.4k views
ADD COMMENT
3
Entering edit mode

I naively assume that your aligner processes reads sequentially as they are present in the fastq file. If so, you could check if the last read in the fastq file is present in the sam file.

But to be entirely sure safest would be to rerun it.

Also: if possible, avoid the creation of a .sam file and directly create a (sorted) .bam That will save in time/memory/intermediate files.

ADD REPLY
1
Entering edit mode

Another advantage of BAM Files is that they have got 28-byte EOF-signature at the end. So any truncation of bamfiles can be easily detected.

ADD REPLY
0
Entering edit mode

Its just sam created so far. I've pipelined bam into it

ADD REPLY
0
Entering edit mode

Its just sam created so far

In fact, that's the point. You should always create (sorted) bam directly by using pipes. Bam files are more easy to handle, easy to detect corruption and take way much lesser space.

ADD REPLY
0
Entering edit mode

Some one just turned my system off.

One cannot take things for granted in some parts of the world.

ADD REPLY
0
Entering edit mode

But, if it was written to bam, OP would easily find out if it's truncated or not, right?

ADD REPLY
0
Entering edit mode

With Bamfile, of course yes. But am not sure if this was your Q?!

ADD REPLY
0
Entering edit mode

Its driving me crazy, but yes worst case scenerios are always there at the door step just exactly when you set to go all smooth. Biostars is a great blessing though!

ADD REPLY
0
Entering edit mode

That is one good suggestion. I checked it and looks like sam is created sequenctially and tailing the both fastq and sam file reveals same headers. Apparently it seems as it it ran completely. But I would definitely rerun it rather assume about it completion.

Thanks a ton!

ADD REPLY
0
Entering edit mode

Hi Wouter,

Could you guide me to the command for Bam file creation directly, instead of sam file. I wantedto do it directly in a single command with pipelines but it wont work that way.

ADD REPLY
1
Entering edit mode
what_ever_is_producing_sam | samtools view -bS > my.bam
ADD REPLY
1
Entering edit mode

In addition to the general answer from Santosh, an example for bwa (also doing sorting of bam):

bwa mem genome.fa reads.fastq | samtools sort -O BAM -o output.bam -

Note the final - in the command. That's not a typo. That means that samtools has to read from stdin.

ADD REPLY
0
Entering edit mode

Does this only work in single thread mode?

ADD REPLY
0
Entering edit mode

Sorry wrong comment earlier. Deleted.. Why it should not work in single-threads?

ADD REPLY
0
Entering edit mode

You can run both bwa and samtools with multiple threads in this command.

bwa mem -t 48 genome.fa reads.fastq | samtools sort -@ 20 -O BAM -o output.bam -

Adjust accordingly to your available infrastructure.

ADD REPLY
0
Entering edit mode

With a construct like this I don't know if the sort does not start until the alignments are complete since sorting is not happening in memory and essentially random alignment data is streaming in from multiple threads. Perhaps there is some programming trick that makes this more efficient to use as a pipe that I do not know about.

I sort my BAM files after they are created.

ADD REPLY
1
Entering edit mode

I don't know if the sort does not start until the alignments are complete

Not the case. Sorting several partially sorted data is cheaper than completely random data.

Sorting start as soon as it has some data (may be there is a minimum threshold). This you can check by running the above command and monitoring the process by top. You will see that sorting threads top the memory/cpu consumption many times during the whole alignment process.

ADD REPLY
0
Entering edit mode

I'm not sure about how efficient this is, but it avoids intermediate files. Would be interesting to benchmark...

ADD REPLY
2
Entering edit mode

I mostly use bbmap which writes BAM file directly so there are no intermediate files in that sense.

I am going to test to see if there is actually a huge benefit to piping. The benefit, if any, will only be there when you have a corresponding high performance storage subsystem. Doing this on a single CPU machine with local disks is going to easily saturate PCI-E bus.

ADD REPLY
0
Entering edit mode

Would be very much interested to know the results of your test

ADD REPLY
1
Entering edit mode

A small test (2433680 reads, 60983363 bases) of single-end miRNA sequence data aligned to GRCh38 genome using BBMap (v. 37.22) and Samtools (v. 1.4.0).

        1. BBMap to generate a BAM (16 threads) 
        real    10m34.955s
        user    158m34.678s
        sys     0m39.217s

        2. Sort the BAM file (16 threads)
        real    0m3.928s
        user    0m13.162s
        sys     0m0.525s

        3. BBMap generated BAM piped into Samools sort (16 threads each)
        real    9m53.137s
        user    156m5.185s
        sys     0m32.050s

        4. BBMap generated SAM piped into Samools sort (16 threads each)
        real    11m15.153s
        user    170m3.938s
        sys     0m35.918s
ADD REPLY
0
Entering edit mode

Oh quite nice improvement.

ADD REPLY
0
Entering edit mode

Comment with an implied /s?

Just trying to make sure :)

ADD REPLY
0
Entering edit mode

Well it's 5% time profit, so probably not too bad on bigger datasets (assuming it's a linear gain). But for me the main reason is having no lingering intermediate files.

ADD REPLY
1
Entering edit mode

With BBMap there are no intermediate files (unless you are counting the intermediate files produced by samtools during sort).

Here is another test with a bigger dataset. Genomic sequencing (21802852 reads, 3179327811 bases) aligning to 20 bacterial genomes.

1. BBMap to generate a BAM (16 threads)
real    6m13.342s
user    106m57.934s
sys     0m21.866s

2. Sort the BAM file (16 threads)
real    1m9.774s
user    5m50.328s
sys     0m20.268s

3. BBMap generated BAM piped into Samools sort (16 threads each)
real    6m51.262s
user    102m12.606s
sys     0m32.568s

4. BBMap generated SAM piped into Samools sort (16 threads each)
real    7m15.386s
user    105m30.430s
sys     0m41.786s
ADD REPLY
0
Entering edit mode

Your unsorted bam is an intermediate file, no?

Slightly unrelated: is there a benchmark comparison of bbmap vs bwa by chance for speed, memory, accuracy?

ADD REPLY
0
Entering edit mode

I don't think of it that way but you are correct.

ADD REPLY
0
Entering edit mode

So, there is not much difference. However, it's difficult to make a direct comparison here because your BBmap is producing Bam, which is a compressed file-format. Sorting it needs decompressing, then sorting, then re-compressing. If a program produces sam, that is already uncompressed, you would be better off direct sorting it, and converting to sorted bam. In this way, you will save time which is for 1) compressing/converting the initial sam to bam 2) decompressing the bam for sorting.

ADD REPLY
1
Entering edit mode

We are on the verge of hijacking this thread so this is my last addition.

BBMap can produce SAM files (by just changing out=output.sam instead of out=output.bam) and to address your point I added the stat for that method (BBMap SAM in to samtools sort) in the post above. It takes an additional 25 seconds by that method.

ADD REPLY
0
Entering edit mode

Here's what I have been working on: But somehow this isnt working well on bam filling

#!/bin/bash
bwa mem -t 48 -M ref.fa data_1.fastq data_2.fastq >mini-data.sam
samtools view -b -S -F 2308 -t 48 data.sam >data_unsorted.bam
samtools sort data_unsorted.bam data| samtools index data.bam
ADD REPLY
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

No viewing? no samtools view command? In that case we are not indexing either. Additionally the bam created here, would be sorted right?

ADD REPLY
0
Entering edit mode

The output of bwa is sam format, which goes directly into samtools sort and gets outputted as bam. The resulting bam can be indexed (but not using pipes!).

ADD REPLY
0
Entering edit mode
**bwa mem -t 48 genome.fa reads.fastq | samtools sort -@ 20 -O BAM -o output.bam -**

It didnt work, following are the errors: sort: invalid option -- 'O' open: No such file or directory [bam_sort_core] fail to open file BAM [M::bwa_idx_load_from_disk] read 0 ALT contigs

ADD REPLY
0
Entering edit mode

I guess your samtools version is outdated.

ADD REPLY
0
Entering edit mode

Generally, there is no need for -O BAM for bam (default). From samtools manual:

By default, samtools tries to select a format based on the -o filename extension; if output is to standard output or no format can be deduced, bam is selected.

ADD REPLY
1
Entering edit mode

If a process did not complete cleanly you should consider re-running it again. Time saved now may come back to haunt you later.

ADD REPLY
0
Entering edit mode

there isn't anyway to check the process??? any command? anything?

ADD REPLY
0
Entering edit mode

You could check to see how far the alignments got (based on reads that are in the SAM file). Remove any corrupt records at the end of the file (you may have to use sed creatively since you can't open a 70G file in an editor). You could then start a new alignment job with the subset of reads that were not originally aligned. Then finally merge the two alignment files.

ADD REPLY
0
Entering edit mode
6.9 years ago
h.mon 35k

You may try BamHash. It may be faster and/or safer to just map again, though.

ADD COMMENT
0
Entering edit mode

There was no BAM file in this case and it is not clear if BamHash can be used with SAM files (though it can be used with FASTQ files).

ADD REPLY

Login before adding your answer.

Traffic: 2324 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6