StringTIe Error: no valid ID found for GFF record
Entering edit mode
3.3 years ago
1234gingko ▴ 50

hi, I successfully aligned and analyzed my RNA-Seq data using Hisat2 and StringTIe and DESeq2 with the La_Amiga3_1 genome (white lupin) from NCBI to map transcripts. Beginner's luck. Now I am trying to do the exact same thing using the CNRS_Lalb genome (also white lupin on NCBI), and when I get to the first StringTIe step, I get "Error: no valid ID found for GFF record". I have looked at both the genome GTF files, and the first field (chromosome id) looks great (cut -f 1 *.gtf | sort | uniq) and they have a different name for the chromosomes, but look fine. I don't think that is the problem, and am looking for more hints as to what this means - I did read the StringTie manual but need more help. thanks very much, K

RNA-Seq • 7.9k views
Entering edit mode

omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file". This gets me back on track so fast, I appreciate it! - Karen

Entering edit mode

Can you please post a couple of lines of the GTF file?

Entering edit mode

sure, thanks:

head -50 CN*/*.gtf
#gtf-version 2.2
#!genome-build CNRS_Lalb_1.0
#!genome-build-accession NCBI_Assembly:GCA_009771035.1
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; 
WOCE01000065.1  Genbank exon    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id "Lalb_Chr00c40g0409291"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; note "5s_rRNA"; 
WOCE01000065.1  Genbank exon    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id "Lalb_Chr00c40g0409301"; gbkey "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; product "5S ribosomal RNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; 
WOCE01000065.1  Genbank exon    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id "Lalb_Chr00c40g0409311"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; 
WOCE01000065.1  Genbank exon    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id "Lalb_Chr00c40g0409321"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2659    2810    .   -   .   gene_id "Lalb_Chr00c40g0409331"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409331";
Entering edit mode
3.3 years ago

My guess is its those lines with transcript_id=="", they don't contain a valid ID, and so StringTie is complaining. Its always a bit of the worry to work out what to do with a the transcript_id field on gene lines in a GTF file. The orignal GTF format didn't contain gene lines, but they appear to have crept in at some point. The ENSEMBL files just don't have a transcript_id field on their gene lines, but i bet that trips StringTie up as well.

For for what to do: I recommend just removing the gene lines. They are not necessary anyway. Something like:

awk '$3 != "gene" ' my_annotation.gtf > my_annotation_no_genes.gtf
Entering edit mode
2.2 years ago
bio • 0

Hi! I also suffer the same problem,and i don't know how to fix it

Entering edit mode
2.2 years ago
Juke34 8.6k

You can try AGAT


WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"

Remove transcript_id attribute to gene feature: --gff test.gtf -p gene --att transcript_id -o test.gff


##gff-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   ID=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank RNA 90  241 .   -   .   ID=Lalb_Chr00c40g0409271;Parent=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank exon    90  241 .   -   .   ID=exon-1;Parent=Lalb_Chr00c40g0409271;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank gene    417 575 .   -   .   ID=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank RNA 417 575 .   -   .   ID=Lalb_Chr00c40g0409281;Parent=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank exon    417 575 .   -   .   ID=exon-2;Parent=Lalb_Chr00c40g0409281;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409281

Convert into GTF --gff test.gff -o --gff test_clean.gtf


##gtf-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; ID "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271";
WOCE01000065.1  Genbank transcript  90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "Lalb_Chr00c40g0409271"; Parent "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; original_biotype "rna";
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "exon-1"; Parent "Lalb_Chr00c40g0409271"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA";
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; ID "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281";
WOCE01000065.1  Genbank transcript  417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "Lalb_Chr00c40g0409281"; Parent "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; original_biotype "rna";
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "exon-2"; Parent "Lalb_Chr00c40g0409281"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA";

Login before adding your answer.

Traffic: 1255 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6