Tutorial:A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml
0
14
Entering edit mode
6.4 years ago
Joseph Hughes ★ 3.0k

A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED

Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss of some information.

Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.

The command is shown in full below the table.

Conversion From Row/To Col blast-xml blast-tab psl pslx SAM/BAM BED
blast-xml N/A blast2tsv.xsl blastXmlToPsl blastXmlToPsl -pslx blast2bam BLAST_to_BED
blast-tab perl blast2xml.pl N/A ?? ?? ?? blast2bed
psl ?? ?? N/A pslToPslx psl2sam.pl pslToBed
psl2BED
pslx ?? ?? ?? N/A ?? pslToBed
SAM/BAM ?? ?? sam2psl.py sam2psl.py -s samtools view bedtools bamtobed
BED ?? ?? bed2psl ?? bedtools bedtobam N/A

Convert from SAM to psl using sam2psl.py

Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py

Example command:

python sam2psl.py -i test.sam -s -o test.psl

This is a lossless format conversion with the -s option, however the sequence as a read is no longer supported in the psl format.

python sam2psl.py -i test.sam -o test_no_seq.psl

The help for sam2psl:

python sam2psl.py
Usage: sam2psl.py [options]

It takes as input a file in SAM format and it converts into a PSL format file.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -i INPUT_FILENAME, --input=INPUT_FILENAME
                        The input file in SAM format.
  -4, --skip-conversion-cigar-1.3
                        By default if the CIGAR strings in the input SAM file
                        are in the format defined in SAM version 1.4 (i.e.
                        there are 'X' and '=') then the CIGAR string will be
                        first converted into CIGAR string, which is described
                        in SAM version 1.3, (i.e. there are no 'X' and '='
                        which are replaced with 'M') and afterwards into PSL
                        format. Default is 'False'.
  -s, --read-seq        It adds to the PSL output as column 22, the sequence
                        of the read. This is not anymore a valid PSL format.
  -r REPLACE_READS_IDS, --replace-read-ids=REPLACE_READS_IDS
                        In the reads ids (also known as query name in PSL) the
                        string specified here will be replaced with '/' (which
                        is used in Solexa for /1 and /2).
  -o OUTPUT_FILENAME, --output=OUTPUT_FILENAME
                        The output file in PSL format.

Convert from psl to SAM

Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl

Example command:

 psl2sam.pl test.psl

This ends up being a lossy conversion as the read sequence is not in the output.

Usage

psl2sam.pl 
Usage: psl2sam.pl [-a 1] [-b 3] [-q 5] [-r 2] <in.psl>

The options are used to calculate a blast like scoring see post: How To Use Psl2Sam.Pl From Samtools?

Convert psl to pslx

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

pslToPslx test_no_seq.psl test.fa ref.fa test.pslx

This is a lossless conversion. For usage:

pslToPslx - Convert from psl to pslx format, which includes sequences
usage:
   pslToPslx [options] in.psl qSeqSpec tSeqSpec out.pslx

qSeqSpec and tSeqSpec can be nib directory, a 2bit file, or a FASTA file.
FASTA files should end in .fa, .fa.gz, .fa.Z, or .fa.bz2 and are read into
memory.

Options:
  -masked - if specified, repeats are in lower case cases, otherwise entire
            sequence is loader case.

Convert SAM to fasta

awk '$1~!/^@/ {print ">"$1"\n"$10}' test.sam > test.fa

Convert psl to BED

Option 1:

Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as the standard psl doesn't have the sequence and so the bed file doesn't either.

pslToBed test_no_seq.psl test.bed

Option 2 as suggested by Alex Reynolds:

Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

This is also lossless when used with --keep-header:

Example:

psl2bed < in.psl > out.bed

As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.

Convert pslx to BED

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion as the sequence is lost

pslToBed test.pslx test.bed

Convert BAM to BED using bedtools

http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html

This is a lossy conversion as the sequence data is lost.

bedtools bamtobed -i test.bam > test_bamtobed.bed

Convert BED to BAM

Create the genome file for bed

samtools faidx ref.fa 
awk -v OFS='\t' {'print $1,$2'} ref.fai > ref.txt

Using the genome file and BED file to produce the BAM file.

bedtools bedtobam -i test_bamtobed.bed -g ref_revcomp.txt > test_bedtobam.bam

The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion. Additionally, there are differences in the number of read compared to the original file.

Convert BED to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as neither the BED nor psl contain sequence information

bedToPsl Longest_revcomp.txt test.bed test_bedtopsl.psl

Usage:

bedToPsl - convert bed format files to psl format
usage:
   bedToPsl chromSizes bedFile pslFile

Convert a BED file to a PSL file. This the result is an alignment.
 It is intended to allow processing by tools that operate on PSL.
If the BED has at least 12 columns, then a PSL with blocks is created.
Otherwise single-exon PSLs are created.

Options:
-keepQuery  -  instead of creating a fake query, create PSL with identical query and
                target specs. Useful if bed features are to be lifted with pslMap and one 
                wants to keep the source location in the lift result.

Preparing blast-xml format

makeblastdb -dbtype nucl -in Longest_revcomp.fa
blastn -query test.fa -db Longest_revcomp.fa -out test.blastxml -outfmt 5

Preparing blast-tab format

blastn -query test.fa -db Longest_revcomp.fa -out test.blasttab -outfmt 6

Blast-xml to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion

blastXmlToPsl test.blastxml test_blastxmltopsl.psl

However, if you use the -pslx option, you can get lossless conversion

blastXmlToPsl -pslx test.blastxml test_blastxmltopsl.pslx

Converting from blast-xml to SAM/BAM

Using https://github.com/guyduche/Blast2Bam

This is a lossless conversion with sequence and read quality introduced.

blast2bam -o test_blastxmltosam.sam test.blastxml ref.fa reads_1.fq reads_2.fq 

Usage

Blast2Bam. Last compilation: Jun 27 2017 at 15:21:50.

Usage: blast2bam [options] <Blast XML output> <reference sequence dictionary> <FastQ_1> [FastQ_2]

Options:
 --output         | -o FILE       Output file (default: stdout)
 --interleaved    | -p            Interleaved data
 --readGroup      | -R STR        Read group header line '@RG\tID:foo'
 --minAlignLength | -W INT        Discard alignments shorter than [INT]
 --shortCigar     | -c            Short version of the CIGAR string ('M' instead of '=' and 'X')
 --posOnChr       | -z            Adjust the alignment position to the first position of the reference
 --help           | -h            Get help (this screen)

Subsequently converted to BAM using samtools

samtools view -b test_blastxmltosam.sam > test_blastxmltosam.bam

Blast-xml to BED

Using https://github.com/mojaveazure/BLAST_to_BED

Command

BLAST_to_BED.py -x test.blastxml -o test_blastxmltobed.bed

This is a lossless conversion as the sequence information is lost.

Converting Blast tabular to blast-xml

Not completely possible due to missing information such as the alignment but see post Convert Blast Output Into Blast-Xml Or using the script from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg

Command

perl blast2xml.pl -i test.blastxml -o test_blasttoblastxml

Usage

perl blast2xml.pl 

- i|input     : path of the input file (must be text blast file output)
- o|output    : path of the output file (by default, the same as the input file)
- s|sequences    : number of sequences by xml file (default inf)
- hit        : number of hit to print for each sequences (default inf)
- hsp        : number of hsp to print for each hit (default inf)
- help|h|?    : print this help and exit

Convert blast-xml to blast tabular

Several approaches have been suggested here Tools Parsing Ncbi Blast -M 7 Xml Output Format? but the most straight forward I have found is using the style sheet blast2tsv.xml from here: https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formated blast-tabular output as it contains the sequence and the aligned site information in the last two columns.

Command:

xsltproc --novalid blast2tsv.xsl test.blastxml

Blast tabular to BED

Using https://github.com/nterhoeven/blast2bed

blast2bed test.blasttab

The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.

Usage

Usage: ./blast2bed <blastoutput.bls>
The blast file should be in blast outfmt 6 or 7.
See Readme.org for more details.

Converting SAM to BAM

Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html This is a lossless format conversion.

Command

samtools view -bS test.sam > test.bam

Converting BAM to SAM using samtools

This is a lossless format conversion.

samtools view -h -o test.sam test.bam

PS: I dedicate this tutorial to Sej, a great bioinformatician and friend ;)

PPS: Due to the limited number of characters, I had to remove the usage information for each of the tools. A more complete version is cross-posted here.

psl SAM blast BED • 7.1k views
ADD COMMENT
1
Entering edit mode

There's also psl2bed, which is also lossless when used with --keep-header: http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

Example:

$ psl2bed < in.psl > out.bed

As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.

ADD REPLY

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6