Question

How to run SnpEff for Tomato 2.5 (S_lycopersicum_chromosomes.2.50) reference, gff3 and VCF?

1

Entering edit mode

7.3 years ago

William ★ 5.3k

I would like to run SnpEff for variant effect prediction in Tomato.

The master data (reference + gff3) that I am using is from the SolGenomics consortium.

wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/S_lycopersicum_chromosomes.2.50.fa.gz
wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3.gz

Against this reference I aligned reads with BWA and called variants with Freebayes, thus creating a VCF file.

Information from the readme in the FTP directory indicates that the Fasta and GFF3 are compatible.

Please note that ITAG2.4 corresponds to build 2.5 of the tomato genome

ITAG2.4_gene_models.gff3 = GFF version 3 file containing gene models in this release.

I used the following commands to create a custom SnpEff database:

gunzip ITAG2.4_gene_models.gff3.gz
gunzip S_lycopersicum_chromosomes.2.50.fa.gz

mv ITAG2.4_gene_models.gff3 genes.gff
mv S_lycopersicum_chromosomes.2.50.fa sequences.fa

java -jar snpEff.jar build -gff3 -v S_lycopersicum_2_5

I added this line to the snpEff.config file

S_lycopersicum_2_5.genome : S_lycopersicum_2_5

The version of SnpEff that I am using is:

00:00:00        SnpEff version SnpEff 4.3 (build 2016-07-03 08:26), by Pablo Cingolani

According to the SnpEff summary output (see below) 25% of the transcripts have "STOP codons in CDS errors". For other species I normally only get some UTR errors, with 0% errors for the other categories.

Does this indicate that there is something wrong with the SolGenomics GFF3 file?

Or could this be a bug / incompatibility in SnpEff?

#-----------------------------------------------
# Genome name                : 'S_lycopersicum_2_5'
# Genome version             : 'S_lycopersicum_2_5'
# Genome ID                  : 'S_lycopersicum_2_5[0]'
# Has protein coding info    : true
# Has Tr. Support Level info : true
# Genes                      : 34725
# Protein coding genes       : 34725
#-----------------------------------------------
# Transcripts                : 34725
# Avg. transcripts per gene  : 1.00
# TSL transcripts            : 0
#-----------------------------------------------
# Checked transcripts        :
#               AA sequences :      0 ( 0.00% )
#              DNA sequences :      0 ( 0.00% )
#-----------------------------------------------
# Protein coding transcripts : 34725
#              Length errors :   5622 ( 16.19% )
#  STOP codons in CDS errors :   8812 ( 25.38% )
#         START codon errors :   3380 ( 9.73% )
#        STOP codon warnings :    400 ( 1.15% )
#              UTR sequences :  15666 ( 45.11% )
#               Total Errors :   9177 ( 26.43% )
#-----------------------------------------------
# Cds                        : 157233
# Exons                      : 160001
# Exons with sequence        : 160001
# Exons without sequence     : 0
# Avg. exons per transcript  : 4.61
# WARNING                    : No mitochondrion chromosome found
#-----------------------------------------------
# Number of chromosomes      : 13
# Chromosomes                : Format 'chromo_name size codon_table'
#               'SL2.50ch01'    98543444        Standard
#               'SL2.50ch09'    72482091        Standard
#               'SL2.50ch03'    70787664        Standard
#               'SL2.50ch07'    68045021        Standard
#               'SL2.50ch12'    67145203        Standard
#               'SL2.50ch04'    66470942        Standard
#               'SL2.50ch05'    65875088        Standard
#               'SL2.50ch08'    65866657        Standard
#               'SL2.50ch10'    65527505        Standard
#               'SL2.50ch11'    56302525        Standard
#               'SL2.50ch02'    55340444        Standard
#               'SL2.50ch06'    49751636        Standard
#               'SL2.50ch00'    21805821        Standard
#-----------------------------------------------

I am not using the pre-built SnpEff Solanum_lycopersicum BIN file because I need to be sure that the master data (fasta+gff3) equals the data from the SolGenomics.

Is there a way to find out what the cause of the errors is?

And how can I get the combination of SnpEff and the SolGenomics GFF3 to work?

snpeff • 2.9k views

ADD COMMENT • link updated 7.3 years ago by Vitis ★ 2.5k • written 7.3 years ago by William ★ 5.3k

score 2 · Accepted Answer · 2017-01-17

2

Entering edit mode

7.3 years ago

Vitis ★ 2.5k

The easy way is to use something else to predict the effects of your SNPs and see whether the same thing happens (much more stop codons than expected). I'd suggest to give Ensembl VEP a try.

ADD COMMENT • link 7.3 years ago by Vitis ★ 2.5k