Biostar Beta. Not for public use.
Creating taxonomic database using GFF3 file ?
0
Entering edit mode
16 months ago
lokraj2003 • 70

I am trying to create a taxonomic database using GenomicFeatures package. I downloaded GFF3 file from the NCBI.

Codes :

orf <- GenomicFeatures::makeTxDbFromGFF("orf.gff3",format="auto")

I get following output :

Orf

TxDb object:

Db type: TxDb

Supporting package: GenomicFeatures

Data source: mouse.gff3

Organism: NA

Taxonomy ID: NA

miRBase build ID: NA

Genome: NA

transcript_nrow: 0

exon_nrow: 0

cds_nrow: 0

Db created by: GenomicFeatures package from Bioconductor

Creation time: 2019-05-29 22:32:09 -0500 (Wed, 29 May 2019)

GenomicFeatures version at creation time: 1.32.2

RSQLite version at creation time: 2.1.1

DBSCHEMAVERSION: 1.2

Link to the genome : https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1

As you can see that there are no genes in this database. Can anyone help with this please ?

ADD COMMENTlink
2
Entering edit mode
16 months ago
SMK ♦ 1.3k
Ghent, Belgium

Hi lokraj2003,

You can add gene features to the gff3 file that you downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1, then re-load it again using the same function.

Like this (it can be any method you like to re-format the original gff3, here for example awk with focus on creating gene lines, adding Parent to each CDS, and I leave other detail parsing to you):

$ cat orf.gff3 \
  | awk 'BEGIN{FS=OFS="\t"} $3!="CDS"{print $0} $3=="CDS"{GENE=$0; gsub("\t0\t", "\t\.\t", GENE); gsub("CDS", "gene", GENE); gsub("cds", "gene", GENE); gsub(";product=.*", "", GENE); print GENE; ID=$9; gsub(".*;protein_id=", "", ID); print $0 ";Parent=gene-" ID}' \
  > orf_re.gff3

$ head orf_re.gff3
##sequence-region AY386263.1 1 137241
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10258
AY386263.1  Genbank region  1   137241  .   +   .   ID=AY386263.1:1..137241;Dbxref=taxon:10258;country=USA: Iowa;gbkey=Src;genome=genomic;isolate=ORFA;isolation-source=nasal secretions of a lamb at the Iowa Ram Test Station during an outbreak in 1982%2C then passaged in ovine fetal turbinate cells;mol_type=genomic DNA;strain=OV-IA82
AY386263.1  Genbank gene    2409    2858    .   -   .   ID=gene-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=gene
AY386263.1  Genbank CDS 2409    2858    .   -   0   ID=cds-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=CDS;product=ORF001 hypothetical protein;protein_id=AAR98099.1;Parent=gene-AAR98099.1

And using that gff3 you'll get:

> orf <- GenomicFeatures::makeTxDbFromGFF("orf_re.gff3", format = "auto")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> orf
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: orf_re.gff3
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 130
# exon_nrow: 130
# cds_nrow: 130
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-05-30 16:57:03 +0200 (Thu, 30 May 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

Hope it helps. :-)

ADD COMMENTlink
0
Entering edit mode

Thank you ! It worked.

ADD REPLYlink
0
Entering edit mode

You're welcome! If an answer was helpful you can upvote it, if the answer resolved your question you can mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1