Tool:Annoyed by contig name inconsistencies (eg chr22 vs 22)? A solution
0
4
Entering edit mode
2.7 years ago
Divon ▴ 230

Genozip is a software package (developed by yours truely) for compression of genomic files (BAM, FASTQ, VCF and others). It typically compresses 2x-5x better than .gz.

In my research projects, I am constantly spending too much time searching for the right reference file with the right contig names (eg chr22 vs 22; MT vs chrM etc) for BAM and VCF files on hand, or else various bioinformatics tools I use tend to break. So, I decided to solve this issue once and for good, by using Genozip.

Today, I released a new simple feature to handle this: with the command line option --match-chrom-to-reference, your file's contigs are updated to match those of the provided reference.

Example (notice contig 1 is converted to chr1 both in the header and in the data line):

> cat example.sam

@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701       99      1       9997    34      28M1I6M1I39M4D68M7S     =       10159   324     CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA  FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:,  AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33     NM:i:12 RG:Z:1

> genozip example.sam --reference hg19.p13.plusMT.full_analysis_set.ref.genozip --match-chrom-to-reference
genozip example.sam : Done (1 second, SAM compression ratio: 14.4)

> genocat example.sam.genozip

@HD VN:1.4  SO:coordinate
@SQ SN:chr1 LN:249250621
A00910:85:HYGWJDSXX:2:2502:31647:7701       99      chr1    9997    34      28M1I6M1I39M4D68M7S     =       10159   324     CCCTTAACCCTAACCCTAACCCTAACCCTTAACCCTTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAAACCCTAACCCTAACCCTAACCCTAACCCAAACCAAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCAAA  FFFFFFFFFFF:FFFFFFF:FFFFFFFFF:F:FFFF:FFFFFFFFF:FFFFFFFFF:FF,:FFFFFFFFFFF,FFFFFFFF:FFF:::FFFF,F::FF:FFFFF::,FF,::FFF,:,FFF,,,,FF,::FFF:F,FF,,:FF:FFF,:,  AS:i:99 XS:i:96 MD:Z:0N0N0N0N69^CCCT29T4C33     NM:i:12 RG:Z:1

Documentation: https://genozip.com/match-chrom.html

Installing: https://genozip.com/installing.html

Publication: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

contigs bam vcf reference compression • 735 views
ADD COMMENT

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6