Biostar Beta. Not for public use.
Tutorial:Polish PacBio assembly with latest PacBio tools : an affordable solution for everyone
31
Entering edit mode
17 months ago
Roxane Boyer • 950
France / Toulouse / GeT-Plage

This tutorial addresses to anyone in possession of PacBio sequencing data. After struggling myself to find and test appropriated tools, learning PacBio specific format, asking for precisions to kind PacBio guys, I've managed to gather a lot of useful informations that everyone may need in order to process their data.

I will mainly focuses one of the most important step while doing PacBio genome assembly : the polishing step. Indeed, higher indel rate within PacBio raw reads make it necessary to correct the sequence obtained after assembly. This step consist in two things :

• correct the reference base by base to obtained a polished assembly, which will become your final assembly.

PacBio changed their tools lately, getting ride of some strange format (cmp.h5). But the latest tools are not directly available on pacBio's github, and the one proposed in github are obsolet or really hard to install (I'm refeering to pitchfork). Using advices from Mr Stucki working at PacBio, I managed to install easily and quickly all PacBio tools required for polishing. Today, I'm going to describe all the steps I've followed so anyone could access as well at PacBio tools.

Installation of PacBio tools

1. Download SMRT link : wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_5.0.1.9585.zip ; unzip smrtlink_5.0.1.9585.zip -d destination_folder

Just to warn you, it seems that SMRT Link is not supported on Windows. The steps I'm describing were tested on Ubuntu 14.04 Trusty Tahr.

SMRT Link is a big tool that include all services PacBio can offer. If you have access to a cluster facility and if you match the requirements, you should probably follow classic installation instructions. However, if you just need one or 2 of PacBio tools, you probably don't want to install the whole suit, which is very heavy to install. Here what you should do :

1. Add a new non-root user "smrtanalysis" : sudo adduser smrtanalysis
2. Define in your bash what SMRT_USER is : SMRT_USER=smrtanalysis
3. Log on as the freshly created user : su $SMRT_USER -> Extra tip : type bash to grab your bash environment and have access to auto-completion while typing in terminal 4. Define SMRT_ROOT : SMRT_ROOT=opt/pacbio/smrtlink -> Extra tip : The$SMRT_ROOT directory must not exist when the installer is invoked, or installation will abort
5. Launch SMRT analysis with option "tools-only" : ./smrtlink_5.0.1.9585.run --rootdir SMRT_ROOT --smrttools-only 6. After installation completed, move into directory that contains PacBio binaries : cd /opt/pacbio/smrtlink/smrtcmds/bin -> Extra tip : add this directory in your PATH bash variable, so you will be able to launch PacBio tool from anywhere 7. Congratulations ! You can now use every PacBio tools ! Here what you are supposed to have within this directory : arrow DBdust ice_polish pbsmrtpipe bam2bam DBrm ice_quiver pbsv bam2fasta DBshow ipdSummary pbsvutil bam2fastq DBsplit ipython pbtranscript bamtools DBstats ipython2 pbvalidate bamtools-2.4.1 dexta juliet picking_up_ice bax2bam fasta2DB julietflow python blasr fasta-to-gmap-reference LA4Falcon python2 ccs fasta-to-reference LA4Ice python2.7 cleric fuse laa quiver daligner gmap LAmerge REPmask daligner_p gmap_build LAsort samtools datander gmapl motifMaker sawriter dataset HPC.daligner ngmlr separate_flnc dazcon HPC.REPmask pbalign summarizeModifications DB2Falcon HPC.TANmask pbdagcon TANmask DB2fasta ice_combine_cluster_bins pbindex undexta DBdump ice_partial pbservice variantCaller Assembly polishing with arrow Now that you successfully installed PacBio tools (I hope !), you can start to do use them. As i said, I'm going to focus on the polishing step. The two main tools for polishing are arrow and quiver, two different algorithms included in the variantCaller tool (the polishing tool). If you already wonder about the differences about the two tools, here the differences between the two of them : Quiver is supported for PacBio RS data. Arrow is supported for PacBio Sequel data and RS data with the P6-C4 chemistry. As I understand it on PacBio github (https://github.com/PacificBiosciences/GenomicConsensus), the latest and thus more efficient tool is arrow. As I was myself very disappointed by quiver result (my case is particular, high polymorphism degree), I'm only going to describe polishing steps with the latest arrow algorithm. Before to describe steps, I'm going to explain a few of my vocabulary about PacBio data. PacBio result, at least for my project, but I heard it was all the same everywhere, are organized by runs. Each run can contain several folder corresponding to one SMRT_cell. In an SMRT_cell directory, you normally have 3 .bax.h5 files, and 1 .bas.h5 files. What I call an SMRT_cell are the 3 bax.h5 files. Also, something really important I've made, you may probably want to know the polishing coverage you are going to use. If you have a small data set with reasonable coverage (40-50X), then you should probably use all your raw reads for polishing. However, if you dataset is bigger, then you should probably take a subset of your raw reads. I've heard that polishing at very high coverage (>100X), take longer without improving considerably the quality of your assembly. 80X polishing coverage is what I used for correcting my assembly. You can roughly estimate your polishing coverage by converting one SMRT_cell (3 bax files) into fasta format by using the tool pls2fasta (is present in blasr package). Then you can count the total numbers of bases for that SMRT_cell and calculate your coverage with the size of your genome species. An example of pls2fasta command I used : pls2fasta rawReadPB_s1_p0.2.bax.h5 rawReadPB_s1_p0.2.fasta -trimByRegion Now everything is explained, let's get thing started : 1. Create a .fofn file for each SMRT_cell. A fofn file (File Of Files Names) look like this : /media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb1_s1_p0.1.bax.h5 /media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb2_s1_p0.2.bax.h5 /media/loutre/GENOMIC/pacbio_reads/run2/A02_1/Analysis_Results/rawPb3_s1_p0.3.bax.h5 I can't save you with an extra tip this time.. I did this by hand (maybe someone will give a tip in comment ?) 1. For each .fofn file, launch bax2bam : a tool that will simply convert your raw bax file into a more convenient format. But be careful, theses BAM are not alignment yet. Here some bash scraps that may help you starting (sorry code insertion was not working for this): FOFN_DIR="/media/loutre/GENOMIC/pacbio_reads/fofn" COMPT=1 for FOFN in "FOFN_DIR"/* do
echo "Dealing with file "$FOFN ./bax2bam -f$FOFN -o $BAM_DIR"/SMRT-cell-"$COMPT --subread  --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag
COMPT=((COMPT+1)) done  This command will create several bam files : • .scraps.bam : contains all the sequences that were filtered out: low quality regions of the polymerase reads, adapter sequences (, barcodes, control sequences). These were stored (and labelled) in the bax files, and are basically “split” to a different BAM file when converting. • .subreads : your high-quality reads that you want to use for pbalign • .pbi : pacBio index 1. For each PacBio subreads.bam , launch pbalign to align your reads against your reference. This can produce either a bam or a sam file. This time, the output is an alignment. You can launch pbalign as following : ALIGN_DIR="/media/loutre/GENOMIC/pacbio_reads/pbalign" for PB_BAM in "BAM_DIR"/*"subreads.bam"
do
echo "Dealing with file "PB_BAM ./pbalign --nproc 32PB_BAM your/assembly.fasta ALIGN_DIR"/align_"COMPT".bam"
COMPT=\$((COMPT+1))
done


-> Extra tip : don't forget to adjust the number of cores to your own possibility

1. Merge all the produced bam files using samtools merge -> Extra tip : you will probably have to sort all your bam files before merging them.
2. Use arrow using your reference file and your merged bam. My advise is to personalize this step depending on your need. Several options are proposed, like a diploid option to allow more polymorphism, also a lot of read selection filtering options are available. So no copy/paste for this time ! -> Extra tip : You should have an eye on arrow help : arrow -h... Obvious Ostrich strikes again !

I really hope all theses informations could be useful to anyone. I really struggled to gather all theses informations. Took me almost 2 hours to write this tutorial. Of course, this is not the absolute way of using PacBio tools ! But if you are lost, starting with PacBio tools, you can use this as a start to your analysis...

I'm open to feed back, corrections, precisions, questions... I just wanted to help !

Cheers,

Roxane

1
Entering edit mode

Hi Roxane, thanks for this nice howto. That's what is missing in the PacBio docs and also in Canu docs.

You could use a dataset when aligning the raw reads back to the assembly.fasta, in this case you would only need one single "pbalign" call and there would be no need to merge BAM files.

dataset create --type SubreadSet --name myGenomeSet.xml *.subreads.bam


and then

pbalign --nproc 32 myGenomeSet.xml your/assembly.fasta myGenomeSet.bam

1
Entering edit mode

Dear sklages, the right syntax for dataset create is :

dataset create --type SubreadSet --name MyGenome  MyGenome_Set.xml *.subreads.bam

0
Entering edit mode

Thanks for that tip ! So when you create the dataset, you can specify the desired coverage for example ?

0
Entering edit mode

No, I don't think so. With command "create" it simply creates a XML file with all relevant infos about the BAM files.

0
Entering edit mode

For pacbio sequel data, I am running below command which fails with exit status

date && time pbalign --tmpDir tmp --nproc 55 sample.fa sample.contigs.fasta out.sam


Is that command fine? I am looking at this link

sample.fa: is the pacbio sequel fastq file converted to fasta format using samtools

sample.scaffolds.fa: is the scaffold fasta file. scaffolding of canu assembled contigs was done using SMIS

0
Entering edit mode

Thanks so much Roxane - that's really great and will be helpful for the future! I may be using PacBio.

0
Entering edit mode

You're welcome ! Glad it could help :)

0
Entering edit mode

Hello, Thank you for the post. Do you have any idea about how to improve available assembly using PacBio reads?

0
Entering edit mode

I never had to do such a thing, only tried de novo assembly. I would suggest you to read these post (if it's not already done), you can probably grab some useful answers : Gap-filling and scaffolding using PacBio reads Scaffolding with pacbio reads

0
Entering edit mode

Thanks Roxane!

0
Entering edit mode

Hi Kyungyong !

So I'm not really used to PacBio dataset with adaptaters, but I saw an option in the bam2bam tools from pacbio that allow to specify adaptaters and barcodes sequence. Maybe it allow a kind of filter to put apart adapaters form the output bam ?

Here is the SMRT Tools Guide that may be really useful : http://www.pacb.com/wp-content/uploads/SMRT-Tools-Reference-Guide-v4.0.0.pdf

I also found an other pacbio tools called lima : https://github.com/PacificBiosciences/barcoding

To me, it look like something you can do. But maybe I'm wrong... Tell me if theses were useful to you !

Roxane

0
Entering edit mode

Dear Roxane,

Thanks for a detailed post on Pacbio. I had been working on Illumina data but now I have been assigned to work on Pacbio data. Since I am new to this, I have few queries in mind. Firstly: I have 3 bax.h5 and 1 bas.h5 files other than .csx ans .xml file. If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads. After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

Much Thanks, zusman

1
Entering edit mode

Dear Zusman,

If i am not mistaken, bas.h5 file only contains the quality information for the reads. So, for analysis we require .bax.h5 files. Also each .bax.h5 files correspond to different sample.

Everything you said is correct so far.

Also each .bax.h5 files correspond to different sample. After installation of SMRT, the first step is polishing using bax2bam that will remove all low quality and adapter sequences from the reads.

But here, I'm not sure you clearly understood the process. bax2bam is not yet the tool that will perform polishing. bax2bam allow you to convert the "weird" raw PacBio format into a more convenient format : bam (note this is not yet an alignement). So bax2bam does not remove any read, it just convert raw reads into bam reads.

After aligning the subreads.bam file to reference, why do we need to merge all bam files. And you you mean merge all bam files coming from each .bax.h5.

Not really sure about your question here. If you have follow my logic above (but once again, this is a way of doing it, not necessary THE way), you will end with a alignement bam file for each SMRT cell (3 bax.h5 file). You are merging the bam file because arrow (this one is the polishing tool), takes only one alignement file as an input.

I need to perform quasispecie analysis for my data. once i have bam file using pbalign can i directly use the bam file to reconstruct quaasispecies or there are other necessary steps to take before i do that.

I have absolutely no idea (yet) of what quaasispecies is. I think you should explain this to me a little bit further.

Cheers,

Roxane

0
Entering edit mode

Can you guide me how i can process my data. After installation, i need to perform polishing and as you mentioned you used Arrow. Can you mention the command you used. So Arrow removes the low quality reads and adapters?? Once i have performed polishing using Arrow, i can use pbalign to align my each of filtered .bax.h5 files. Or is it that we dont need to perform filtering and directly align against reference. I am confuse how pacbio works. Quasispecies is set of different variants within a viral population. My aim to to check for every patient at different timepoints how viral population behaved. For that i have to evaluate each bax.h5 file and extract information of possible quasispecies and compare across different timepoints.

0
Entering edit mode

arrow works on aligned data. Maybe you should read [1] to get an idea what arrow is doing.

Best, Sven

0
Entering edit mode

This really help me a lot as I am new too. Do you have any idea to classify pacbio subreads into full length reads and non-full length?

0
Entering edit mode

Hello ! Well, I never did such a thing yet, I don't have any idea of how to do it, I'm sorry.

0
Entering edit mode

I'm trying to run PBalign as a step toward running Arrow. I have a large subreads.bam with ~1400x coverage. I have an alignment built with 200x coverage. I attempted to downselect my bam file with samtools to around the same coverage level but the new downselected bam file causes BLASR to freak out. I looked at the two files and they look different. any suggestions on how to make this work?

1
Entering edit mode

You should post this as a new stand-alone question.

0
Entering edit mode

These comment paths are hard to organiz, yikes.

0
Entering edit mode

On main Biostars page (or this page) find and click the New Post button at top right of the page. Paste the text of your question in there. Add relevant tags (no # sign needed before tag names).

0
Entering edit mode

I have the subreads.bam file and the assembly from canu. Can I just use the steps 1 (pbalign) and 2 (arrow) as mentioned above? Can you please point the arrow commands?

0
Entering edit mode

Hello Vijay ! As I wrote that post almost one year ago, I'm pretty sure that the PacBio command lines for arrow and quiver changed already... Also, sadly, I had to use quiver instead of arrow in the end because I had an error even using the safe install I described. That error was only when using arrow, so I never was able to try it in the end. I'm now working in a different workplace so I can't even retrieve what I tried at this time..

0
Entering edit mode

Hi Roxane

thanks for the response. I am happy that you looked at my comment even at this point since the original post is an year old. I understand that things can change pretty quickly specially for the evolving long read sequencing. Thanks again. I am still looking through other posts and struggling as you did before writing this post. If I find something, I will definitely post it here.

cheers

Vijay

0
Entering edit mode

Miss Boyer,

Thank you for your tutorial. I have a situation that I do not know how to solve. I am assembling a bacterial genome, but it seems to be the first one of its species that is sequenced. I do not have a reference genome for this species. What could I use as a reference genome? A bacteria from the same family? or E. coli? Regards Cecilio

1
Entering edit mode

Hi Cecilio !

You're welcome, glad this could help !

Concerning your issue, so you are doing do novo genome assembly. Do you have the estimated size of the bacteria's genome ? What sequencing technology did you used ? (pacbio ? illumina ?) In my opinion, you don't necessarily need a genome reference to perform de novo assembly. I didn't used one for my D. suzukii genome !

1
Entering edit mode

Miss Boyer,

Thank you for your answer. I am doing de novo assemblies of two bacteria. The sequences were obtained from pacbio. I am using Canu 1.7 & 1.8 in two different clusters. For one cluster (an Intel cluster) I am using Canu 1.8. For the other cluster (IBM) I am using Canu 1.7. The estimated size of the genomes are known.

I understand now that I do not need a reference sequence for assembling ( because I am ding a de novo assembly). However, for the next stage, "polishing", arrow needs a reference sequence (I have seen the notes on the usage of arrow). What reference sequence should I use for polishing?

Also, I have some issues with the assemblies. It seems that there may be some DNA from undesired bacteria in the pacbio output (perhaps contamination of the bacterial cultures).

I was advised to use BWA-mem (https://github.com/lh3/bwa) to map the reads to the known contaminant (I got the contaminant bacteria sequence from GenBank). After mapping, I should discard the reads that map to the contaminant and use the un-mapped reads for the assembly of my desired bacteria. This sounds good for one bacteria.

For the other one, I have got only unplaced-contings for the contaminant (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1/GCF_001020485.1_ASM102048v1_assembly_report.txt) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1

How to map my reads to such a set of contigs (108 unplaced ones)?

Sorry to overwhelm you with so many questions.

Regards,

Cecilio1

1
Entering edit mode

Hello again Cecilio !

To make it clear here, the reference I"m talking about here ...

This post I made may not be that "old", years wise, but I don't think it is updated anymore... PacBio were changing all their tools, their formats and their technologies at that moment... I think this whole tutorial really only works with "old" data produced with the "old" PacBio machines (RSII I think ? correct me if I'm wrong), not the Sequel.

Since that day, I've switched on the Oxford Nanopore sequencing technologies and I have to admit I'm not updated anymore on the tools and methods used for PacBio :/.

So I'll say this tutorial still helps to understand what the polishing step is, but you may want indeed to make a new post and get the properly updated tools. I'll try to have a look on the nowadays PacBio methods on my side, and I'll give you my discoveries here (or on your new post), or perhaps when I'll have even more time, give a fresh update to this tutorial :o)

Have a nice day !

Cheers,

Roxane

1
Entering edit mode

Miss Boyer,

Thank you for the information. And thank you for your commitment to help others to understand the process of genome polishing. After I solve the contaminant issues on my reads I will move on to polishing.

Best regards,

Cecilio

0
Entering edit mode

cecilio11 : You may want to post part about separating the contaminants as a new question to get maximum visibility.

You can use a set of contigs to align against by creating a reference index from them.

0
Entering edit mode

Sir genomax, Is samtools the right tool to create a reference index from the unplaced contigs? Regards, Cecilio1

1
Entering edit mode

If you are using bwa mem for alignments you can use bwa index for creating the index.

That said since you have PacBio reads then you may want to look at minimap2 (from author of bwa but for long reads) instead as aligner of choice (https://github.com/lh3/minimap2 ).

0
Entering edit mode

Sir genomax, I will give minimap2 a try. Thank you for your suggestion. Cecilio1

0
Entering edit mode

I did run minimap2 by using the fastq "contaminated" reads against the "contaminant" bacteria sequence. I used a simple command line: minimap2 -a Ref.mmi XXXXX.subreadsFQ.fastq > subreads_alignment.sam Today, Dec 1tth, the cluster I used to run this command is under maintenance and I cannot check the results.

I assume that the next step will be to extract a list of the names of the sequences that mapped to the reference (i.e., seqeunces in the "subreads_alignment.sam" file) and remove those read from the original file (i.e., "XXXXX.subreadsFQ.fastq").

To remove the sequences in a list I will generate from "subreads_alignment.sam", I plan to use the tools suggested in a previous biostars post: One of them is: qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n We have quiime in the cluster, so I can run the above command.

One thing that worries me is that the line-command I used for mnimap2 allows for 15% divergence. I wonder if this will be a problem with closely related species, because it could potentially remove sequences that are "valid" for the target genome.

Have you used the option "asm5" (~5% divergence) or "map-pb" for pacbio reads?

Regards,

Cecilio

0
Entering edit mode

removed by Cecilio today Dec 11

0
Entering edit mode

Did you not see @Roxane's response above?

Please do not use SUBMIT ANSWER window when you are asking follow-up questions or responding to comments. That helps keep the thread logically organized.