What can I do after my Pacbio genome assembly ?
1
2
Entering edit mode
7.1 years ago
Picasa ▴ 640

Hi

I have just assembled my genome with CANU. I am a beginner with Pacbio data and I am wondering now if there is something I can do more ?

1) I heard about Quiver for polishing the assembly: For this I think I need a sam file where my reads have been mapped to my genome. Can I use the mapper blasr for that case?

pacbio • 4.2k views
ADD COMMENT
6
Entering edit mode
7.1 years ago
Rox ★ 1.4k

Hi Picasa !

Indeed, you will need to perform a polishing step after assembly, this will lower your error rate from ~1% (after assembly with canu) to ~0.001-0.0000001% (depending of the coverage used for polishing, you can find information in this link : https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/FAQ.rst section "What is the expected quiver accuracy? ")

So, you indeed need a alignement file, but with PacBio, everything is a bit more complicated. Considering their bas/bax.h5 format, you will need a particular aligner (blasr you said it right) that will produce an alignement. However, this alignment will be not in bam format, but in a specific format called .cmp.h5 (cmp strand for comparison I guess).

Blasr align file by file. In order to align all your .bas/bax.h5, you are going to use an other tool called pbalign. This tool can take as an input a .fofn file (file of file ... forget about the n) which is a list of all the path that lead to your raw data, and will literally call blasr for all your raw data file and then compile their result.

Here a small example of how a .fofn file should look (sorry for horrible paths) :

/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.1.bax.h5
/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.2.bax.h5

As it was really hard for me to gather this informations, here how I used pbalign :

pbalign --forQuiver raw_reads.fofn assembly_to_polish.fasta your_output.cmp.h5

Then, you can use quiver using something like this :

quiver your_output.cmp.h5 -r assembly_to_polish.fasta -o polished_assembly.fasta

For installing pacbio tools, I redirect you to this github thread : https://github.com/PacificBiosciences/pbalign/issues/67

Don't try to install all PacBio tools by yourself, don't even try pitchfork, you will lose a lot of time just as I do.

Also, a related Biostar post here : A: Choosing de novo genome assembly

Good luck with your polishing ! This way worked fine for me after weeks or research on this subject.

ADD COMMENT
1
Entering edit mode

If he needs bam file I think he can use bwa mem -x pacbio ref.fa reads.fq | samtools view -F 4 -Sbh - | samtools sort - my_result_sorted.bam
instead of blasr

ADD REPLY
2
Entering edit mode

Yes but the problem after that is that quiver will not necesserly work well with bam files. Indeed, the particular format for PacBio files take account of various chemistery parameters and quality values. As I said in my previous post, I think that soon or later, PacBio is going to fully abandon theses format and fully use bam/sam. For now, I will not recommand to use anything else but PacBio format because I had a lot of struggle trying to use classic bam/sam format. You can see my struggle on github here : https://github.com/PacificBiosciences/pbalign/issues/48 . I also in the past tried to use quiver using a bam/sam file. It gaves me errors (I don't have wrote them to transcribe it there) because quiver wasn't taking care of sam/bam file (even if it is written on their wikipage...)

ADD REPLY
0
Entering edit mode

Hi Roxane and many thanks for your help !

However I don't have .bas/bax.h5 files but only fastq (my raw reads are fastq files) so can I still use pbalign with your command ? I mean, does it recognize fastq ?

ADD REPLY
0
Entering edit mode

It may be better overall if you can get the *.h5 files from your source.

ADD REPLY
0
Entering edit mode

Hi Picasa,

I agree with Genomax, I had a lot of trouble and could never achieve the final polishing step by using fastq files and pâcbio tools. You have to find your raw reads, they are essentials. Also, try to acknowledge what your sequencing coverage was, it is a very essential information !

ADD REPLY
0
Entering edit mode

I have the sequencing coverage information; but why is it important ?

ADD REPLY
0
Entering edit mode

Coverage is a major parameter with PacBio long read sequencing, because as their raw reads have bigger error rate (~15%), your strategy to lower down this error rate will be different depending on the number of PacBio reads you have.

If you are new to PacBio sequencing, I advise you to read this manual wrote by Mr Chakraborty : https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw654

It show the importance of sequencing coverage and how to obtain good assembly metrics depending on your coverage.

You have different strategy to polish your genome, one is called Self corrected, because you use raw PacBio reads to correct your assembly, and other method are called hybrid, because, in certain cases, you can use shorter Illumina reads to correct your assembly.

Self correction is used when you have enough coverage, Hybrid correction is used when you have low coverage.

I hope I was clear with that, if you have more question about that, don't hesitate to ask me ! But I'm sure you will find all you need in this really great manual !

ADD REPLY
0
Entering edit mode

Hi Roxane,

I was reading this for quiver

https://github.com/PacificBiosciences/GenomicConsensus

and I would like to know if you have some benchmark using the different algorithm proposed by variantCaller (quiver,arrow,plurality,poa,best).

I mean, does it change a lot ? The link above said that apparently arrow is better than quiver

ADD REPLY
0
Entering edit mode

Hi Picasa !

Sadly, I never managed to get arrow to work on my computer (installation issue), same thing with the others algorithm... I will try to test it again, I also really want to know how different the two algorithm work, and especially want to know if one of them is parallelizable (as quiver is not)

ADD REPLY

Login before adding your answer.

Traffic: 2408 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6