FASTQ to SAM File converter
3
0
Entering edit mode
6.8 years ago
inkprs ▴ 70

Hi,

I have a FASTQ file and a reference genome file in FASTA format.

What are the fastest tools available to convert into SAM file?

I also use Hadoop and Spark, are there any tools available in big data world?''

My final goal is to create a VCF file after creating a SAM file.

fastq sequencing big data fasta • 9.3k views
ADD COMMENT
2
Entering edit mode

You should figure out exactly what you are trying to do and understand the whole process before trying to find "the fastest tools".

ADD REPLY
3
Entering edit mode
6.8 years ago

"Converting" the data to sam/bam requires alignment. Although you haven't specified anything about your data, I would first suggest having a look at bwa mem.

ADD COMMENT
2
Entering edit mode

"Converting" the data to sam/bam requires alignment.

No it doesn't! Using the BBMap package:

reformat.sh in=file.fastq out=file.sam

Completely valid sam file, really fast! :) Also that will ensure the final VCF file is extremely small and easy to work with.

*Disclaimer: I do not recommend this approach.

ADD REPLY
2
Entering edit mode

Yes, I'm aware of unmapped sam/bam but decided not to add that to my answer, seems OP is already sufficiently confused :-)

ADD REPLY
1
Entering edit mode

Definitely faster than some of those fancy aligners.

ADD REPLY
0
Entering edit mode

My final goal is to create a VCF file after creating a SAM file.

Won't work for stated final goal :)

ADD REPLY
0
Entering edit mode

Just to make sure, I tried it, and it worked (meaning it created a sam file and VCF file):

reformat.sh in=ATTPA.fq.gz out=foo.sam reads=100
java -ea -Xmx200m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ jgi.ReformatReads in=ATTPA.fq.gz out=foo.sam reads=100
Executing jgi.ReformatReads [in=ATTPA.fq.gz, out=foo.sam, reads=100]

Input is being processed as paired
Input:                          200 reads               30200 bases
Output:                         200 reads (100.00%)     30200 bases (100.00%)

Time:                           0.095 seconds.
Reads Processed:         200    2.10k reads/sec
Bases Processed:       30200    0.32m bases/sec

callvariants.sh in=foo.sam ref=P.heparinus.fa out=foo.vcf
java -ea -Xmx206018m -Xms206018m -cp /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/current/ var2.CallVariants in=foo.sam ref=P.heparinus.fa out=foo.vcf
Executing var2.CallVariants [in=foo.sam, ref=P.heparinus.fa, out=foo.vcf]

Loading reference.
Time:   0.097 seconds.
Processing input files.
Time:   0.018 seconds.
Memory: max=207024m, free=194063m, used=12961m

Processing variants.
Time:   0.002 seconds.

Writing output.
Time:   0.019 seconds.

0 of 0 variants passed filters (NaN%).

Substitutions:  0       NaN%
Deletions:      0       NaN%
Insertions:     0       NaN%
Variation Rate: 0/5167383
Homozygous:     0       NaN%

Time:                           0.195 seconds.
Reads Processed:         200    1.03k reads/sec
Bases Processed:       30200    0.15m bases/sec

As predicted, the resulting VCF was really small and had zero false positives.

ADD REPLY
0
Entering edit mode

But, is it suitable for machine learning?!?

ADD REPLY
1
Entering edit mode

I think the machine might eventually learn that this is not a very good approach.

ADD REPLY
2
Entering edit mode
6.8 years ago

I also use Hadoop and Spark, are there any tools available in big data world?''

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155461

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

ADD COMMENT
1
Entering edit mode
6.8 years ago
EagleEye 7.5k

Check my previous answer from below link and let us know if this is what you want to do, otherwise please explain in detail.

A: FASTQs to the VCF

ADD COMMENT

Login before adding your answer.

Traffic: 2634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6