Biostar Beta. Not for public use.
Getting Annotation Data From A Gff File
0
Entering edit mode
2.7 years ago
robjohn7000 • 70
United Kingdom

Hi,

I' m trying to get annotation data from a gff file using rtracklayer, and got some errors that I can't figure out:

library(rtracklayer)
library(GenomicRanges)
library(Rsamtools)

egff <- import.gff("./data/NC_016810.gff", asRangedData=FALSE)
seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="region"),]))

errors:

 Error in .normargSeqlengths(value, seqnames(x)) : 
 length of supplied 'seqlengths' must equal the number of sequences

Can anyone please help me understand the reason for the errors?

ADD COMMENTlink
0
Entering edit mode

what's the output of length(which(elementMetadata(gff)$type=="region")) and length(seqlengths(gff))? They're apparently not the same.

ADD REPLYlink
0
Entering edit mode

typo? s/egff/gff/

ADD REPLYlink
0
Entering edit mode
17 months ago
Bergen, Norway

You are trying to set the seqlengths attribute of a GRanges object, however from the documentation:

seqlengths

an integer vector named with the sequence names and containing the lengths (or NA) for each level(seqnames).

This attributes refers to the "landmark" sequences (e.g. chromosomes, scaffolds, contigs), not all sequence features in the GFF file ("region"). There is only one chromosome for NC_01681 according to the GenBank file that is:

LOCUS       NC_016810            4878012 bp    DNA     linear   CON 20-AUG-2013
DEFINITION  Salmonella enterica subsp. enterica serovar Typhimurium str.
            SL1344, complete genome.

So, the correct call should be seqlengths(gff) <- c(4878012).

To access or set the length of each feature, use width(gff) or width(gff)<-.

ADD COMMENTlink
0
Entering edit mode

problem solved! Thanks vvery much Michael for the code and the exaplanation. I'm surprised how c(4878012) did the trick; though I know 'c' combines values.

ADD REPLYlink
0
Entering edit mode

Hi,

I am having the same problem with very similar code (that I learnt in a BioC RNAseq workshop). My code is:

gff <- import.gff("dmel-all-r6.05.gff.gz", asRangedData=F)


seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),]))

I then get the same error message as the previous questioner:

Error in .normargSeqlengths(value, seqnames(x)) :
  length of supplied 'seqlengths' must equal the number of sequences

The problem I'm now having, after seeing the solution above, is that I don't understand where the '4878012' length is coming from, or how to work out what the length should be in my case. Would anyone be willing to give a little more explanation?

Thanks very much in advance.

ADD REPLYlink
0
Entering edit mode

I guess the error message is justified. Did you check what both sides of the assignment yield, just compare their length? Possible reasons: you need to look into the GFF file, there is not requirement that all or any chromosomes are defined in the GFF, possibly not all of the landmarks are of type chromosome (imagine mitochondrial genome, unplaced scaffolds, etc.) then your code would fail.

ADD REPLYlink
0
Entering edit mode

Hi,

I compared the lengths of each side of the assignment:

length(seqlengths(gff))

[1] 1871

length(end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])))

[1] 2

Obviously this makes the error message in itself very clear, but I'm not sure what information I need to figure out the correct code.

I also tried to get a handle on what the properties of the .gff file were:

seqnames(gff)

factor-Rle of length 12818678 with 1871 runs Lengths: 10 ... Values : 211000022278031 ... Levels(1871): 211000022278031 211000022278032 ... Y_mapped_Scaffold_9_D1573

ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])

IRanges of length 2 start end width [1] 1 19517 19517 [2] 1 19524 19524

I'm sorry to say that I am still stuck. I have tried setting 'seqlengths(gff)' to some of these lengths (2, 1871, 12818678) but none of these work. I just don't understand which length I should be using, and I can't see where to look to find the correct figure.

I am slightly concerned that there is something odd about the .gff file I was recommended to use, because:

seqinfo(gff)

Seqinfo object with 1871 sequences from an unspecified genome; no seqlengths: seqnames seqlengths isCircular genome 211000022278031 <NA> <NA> <NA> 211000022278032 <NA> <NA> <NA> 211000022278033 <NA> <NA> <NA> 211000022278038 <NA> <NA> <NA> 211000022278047 <NA> <NA> <NA> ... ... ... ... Y_mapped_Scaffold_30_D1720 <NA> <NA> <NA> Y_mapped_Scaffold_34_D1584 <NA> <NA> <NA> Y_mapped_Scaffold_53_D1765 <NA> <NA> <NA> Y_mapped_Scaffold_5_D1748_D1610 <NA> <NA> <NA> Y_mapped_Scaffold_9_D1573 <NA> <NA> <NA>

So there doesn't appear to be any seqlengths? Can this be correct?

Thanks again!

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3