This tutorial addresses to anyone in possession of PacBio sequencing data. After struggling myself to find and test appropriated tools, learning PacBio specific format, asking for precisions to kind PacBio guys, I've managed to gather a lot of useful informations that everyone may need in order to process their data.
I will mainly focuses one of the most important step while doing PacBio genome assembly : the polishing step. Indeed, higher indel rate within PacBio raw reads make it necessary to correct the sequence obtained after assembly. This step consist in two things :
- aligning your raw PacBio reads against the reference (your freshly baked PacBio assembly)
- correct the reference base by base to obtained a polished assembly, which will become your final assembly.
PacBio changed their tools lately, getting ride of some strange format (cmp.h5). But the latest tools are not directly available on pacBio's github, and the one proposed in github are obsolet or really hard to install (I'm refeering to pitchfork). Using advices from Mr Stucki working at PacBio, I managed to install easily and quickly all PacBio tools required for polishing. Today, I'm going to describe all the steps I've followed so anyone could access as well at PacBio tools.
Installation of PacBio tools
- Download SMRT link :
wget https://downloads.pacbcloud.com/public/software/installers/smrtlink_188.8.131.5285.zip ;
unzip smrtlink_184.108.40.20685.zip -d destination_folder
Just to warn you, it seems that SMRT Link is not supported on Windows. The steps I'm describing were tested on Ubuntu 14.04 Trusty Tahr.
SMRT Link is a big tool that include all services PacBio can offer. If you have access to a cluster facility and if you match the requirements, you should probably follow classic installation instructions. However, if you just need one or 2 of PacBio tools, you probably don't want to install the whole suit, which is very heavy to install. Here what you should do :
- Add a new non-root user "smrtanalysis" :
sudo adduser smrtanalysis
- Define in your bash what SMRT_USER is :
- Log on as the freshly created user :
su $SMRT_USER -> Extra tip : type
bash to grab your bash environment and have access to auto-completion while typing in terminal
- Define SMRT_ROOT :
SMRT_ROOT=opt/pacbio/smrtlink -> Extra tip : The $SMRT_ROOT directory must not exist when the installer is invoked, or installation will abort
- Launch SMRT analysis with option "tools-only" :
./smrtlink_220.127.116.1185.run --rootdir $SMRT_ROOT --smrttools-only
- After installation completed, move into directory that contains PacBio binaries :
cd /opt/pacbio/smrtlink/smrtcmds/bin -> Extra tip : add this directory in your PATH bash variable, so you will be able to launch PacBio tool from anywhere
- Congratulations ! You can now use every PacBio tools ! Here what you are supposed to have within this directory :
arrow DBdust ice_polish pbsmrtpipe
bam2bam DBrm ice_quiver pbsv bam2fasta
DBshow ipdSummary pbsvutil bam2fastq DBsplit
ipython pbtranscript bamtools DBstats
ipython2 pbvalidate bamtools-2.4.1 dexta
juliet picking_up_ice bax2bam fasta2DB
julietflow python blasr fasta-to-gmap-reference LA4Falcon
python2 ccs fasta-to-reference LA4Ice
python2.7 cleric fuse laa quiver
daligner gmap LAmerge REPmask
daligner_p gmap_build LAsort samtools
datander gmapl motifMaker sawriter dataset
HPC.daligner ngmlr separate_flnc dazcon
HPC.REPmask pbalign summarizeModifications DB2Falcon
HPC.TANmask pbdagcon TANmask DB2fasta
ice_combine_cluster_bins pbindex undexta DBdump
ice_partial pbservice variantCaller
Assembly polishing with arrow
Now that you successfully installed PacBio tools (I hope !), you can start to do use them. As i said, I'm going to focus on the polishing step. The two main tools for polishing are arrow and quiver, two different algorithms included in the variantCaller tool (the polishing tool). If you already wonder about the differences about the two tools, here the differences between the two of them :
Quiver is supported for PacBio RS data. Arrow is supported for PacBio
Sequel data and RS data with the P6-C4 chemistry.
As I understand it on PacBio github (https://github.com/PacificBiosciences/GenomicConsensus), the latest and thus more efficient tool is arrow. As I was myself very disappointed by quiver result (my case is particular, high polymorphism degree), I'm only going to describe polishing steps with the latest arrow algorithm.
Before to describe steps, I'm going to explain a few of my vocabulary about PacBio data. PacBio result, at least for my project, but I heard it was all the same everywhere, are organized by runs. Each run can contain several folder corresponding to one SMRT_cell. In an SMRT_cell directory, you normally have 3 .bax.h5 files, and 1 .bas.h5 files. What I call an SMRT_cell are the 3 bax.h5 files.
Also, something really important I've made, you may probably want to know the polishing coverage you are going to use. If you have a small data set with reasonable coverage (40-50X), then you should probably use all your raw reads for polishing. However, if you dataset is bigger, then you should probably take a subset of your raw reads. I've heard that polishing at very high coverage (>100X), take longer without improving considerably the quality of your assembly. 80X polishing coverage is what I used for correcting my assembly. You can roughly estimate your polishing coverage by converting one SMRT_cell (3 bax files) into fasta format by using the tool pls2fasta (is present in blasr package). Then you can count the total numbers of bases for that SMRT_cell and calculate your coverage with the size of your genome species.
An example of pls2fasta command I used :
pls2fasta rawReadPB_s1_p0.2.bax.h5 rawReadPB_s1_p0.2.fasta -trimByRegion
Now everything is explained, let's get thing started :
- Create a .fofn file for each SMRT_cell. A fofn file (File Of Files Names) look like this :
I can't save you with an extra tip this time.. I did this by hand (maybe someone will give a tip in comment ?)
- For each .fofn file, launch
bax2bam : a tool that will simply convert your raw bax file into a more convenient format. But be careful, theses BAM are not alignment yet. Here some bash scraps that may help you starting (sorry code insertion was not working for this):
for FOFN in "$FOFN_DIR"/* do
echo "Dealing with file "$FOFN
./bax2bam -f $FOFN -o $BAM_DIR"/SMRT-cell-"$COMPT --subread --pulsefeatures=DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag
This command will create several bam files :
.scraps.bam : contains all the sequences that were filtered out: low quality regions of the polymerase reads, adapter sequences (, barcodes, control sequences). These were stored (and labelled) in the bax files, and are basically “split” to a different BAM file when converting.
.subreads : your high-quality reads that you want to use for pbalign
.pbi : pacBio index
- For each PacBio subreads.bam , launch pbalign to align your reads against your reference. This can produce either a bam or a sam file. This time, the output is an alignment. You can launch pbalign as following :
for PB_BAM in "$BAM_DIR"/*"subreads.bam"
echo "Dealing with file "$PB_BAM
./pbalign --nproc 32 $PB_BAM your/assembly.fasta $ALIGN_DIR"/align_"$COMPT".bam"
-> Extra tip : don't forget to adjust the number of cores to your own possibility
- Merge all the produced bam files using
samtools merge -> Extra tip : you will probably have to sort all your bam files before merging them.
- Use arrow using your reference file and your merged bam. My advise is to personalize this step depending on your need. Several options are proposed, like a diploid option to allow more polymorphism, also a lot of read selection filtering options are available. So no copy/paste for this time ! -> Extra tip : You should have an eye on arrow help :
arrow -h... Obvious Ostrich strikes again !
I really hope all theses informations could be useful to anyone. I really struggled to gather all theses informations. Took me almost 2 hours to write this tutorial. Of course, this is not the absolute way of using PacBio tools ! But if you are lost, starting with PacBio tools, you can use this as a start to your analysis...
I'm open to feed back, corrections, precisions, questions... I just wanted to help !