I would like to run SnpEff for variant effect prediction in Tomato.
The master data (reference + gff3) that I am using is from the SolGenomics consortium.
wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/S_lycopersicum_chromosomes.2.50.fa.gz
wget ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3.gz
Against this reference I aligned reads with BWA and called variants with Freebayes, thus creating a VCF file.
Information from the readme in the FTP directory indicates that the Fasta and GFF3 are compatible.
- Please note that ITAG2.4 corresponds to build 2.5 of the tomato genome
- ITAG2.4_gene_models.gff3 = GFF version 3 file containing gene models in this release.
I used the following commands to create a custom SnpEff database:
gunzip ITAG2.4_gene_models.gff3.gz
gunzip S_lycopersicum_chromosomes.2.50.fa.gz
mv ITAG2.4_gene_models.gff3 genes.gff
mv S_lycopersicum_chromosomes.2.50.fa sequences.fa
java -jar snpEff.jar build -gff3 -v S_lycopersicum_2_5
I added this line to the snpEff.config file
S_lycopersicum_2_5.genome : S_lycopersicum_2_5
The version of SnpEff that I am using is:
00:00:00 SnpEff version SnpEff 4.3 (build 2016-07-03 08:26), by Pablo Cingolani
According to the SnpEff summary output (see below) 25% of the transcripts have "STOP codons in CDS errors". For other species I normally only get some UTR errors, with 0% errors for the other categories.
Does this indicate that there is something wrong with the SolGenomics GFF3 file?
Or could this be a bug / incompatibility in SnpEff?
#-----------------------------------------------
# Genome name : 'S_lycopersicum_2_5'
# Genome version : 'S_lycopersicum_2_5'
# Genome ID : 'S_lycopersicum_2_5[0]'
# Has protein coding info : true
# Has Tr. Support Level info : true
# Genes : 34725
# Protein coding genes : 34725
#-----------------------------------------------
# Transcripts : 34725
# Avg. transcripts per gene : 1.00
# TSL transcripts : 0
#-----------------------------------------------
# Checked transcripts :
# AA sequences : 0 ( 0.00% )
# DNA sequences : 0 ( 0.00% )
#-----------------------------------------------
# Protein coding transcripts : 34725
# Length errors : 5622 ( 16.19% )
# STOP codons in CDS errors : 8812 ( 25.38% )
# START codon errors : 3380 ( 9.73% )
# STOP codon warnings : 400 ( 1.15% )
# UTR sequences : 15666 ( 45.11% )
# Total Errors : 9177 ( 26.43% )
#-----------------------------------------------
# Cds : 157233
# Exons : 160001
# Exons with sequence : 160001
# Exons without sequence : 0
# Avg. exons per transcript : 4.61
# WARNING : No mitochondrion chromosome found
#-----------------------------------------------
# Number of chromosomes : 13
# Chromosomes : Format 'chromo_name size codon_table'
# 'SL2.50ch01' 98543444 Standard
# 'SL2.50ch09' 72482091 Standard
# 'SL2.50ch03' 70787664 Standard
# 'SL2.50ch07' 68045021 Standard
# 'SL2.50ch12' 67145203 Standard
# 'SL2.50ch04' 66470942 Standard
# 'SL2.50ch05' 65875088 Standard
# 'SL2.50ch08' 65866657 Standard
# 'SL2.50ch10' 65527505 Standard
# 'SL2.50ch11' 56302525 Standard
# 'SL2.50ch02' 55340444 Standard
# 'SL2.50ch06' 49751636 Standard
# 'SL2.50ch00' 21805821 Standard
#-----------------------------------------------
I am not using the pre-built SnpEff Solanum_lycopersicum BIN file because I need to be sure that the master data (fasta+gff3) equals the data from the SolGenomics.
Is there a way to find out what the cause of the errors is?
And how can I get the combination of SnpEff and the SolGenomics GFF3 to work?