Question

Format of sequence not recognized

0

Entering edit mode

6.8 years ago

l.souza ▴ 80

I've been trying to run the Epitope Conservancy Analysis tool of IEDB, but when I submit the request it returns:

*Format of sequence not recognized*

My sequence is like this:

>KY449055.1_Foot_and_mouth_disease_virus_type_O
TTSTGESADPVTTTVEXYGGETQVQRRQHTDVSFILDRFVKVTPKDQIXVLDLMQTPAHT
LVGALLRTATYYFADLEVAVKHEGXLTWVPXGAPEAALDXTTXPTAYHKAPLTRLALPYT
APHRVLATVYXGXCKYGEGAVTXVRGDLQVLAQKAARTLPTSFXYGAIKATRVTELLYRM
KRAETYCPRPLLAIHPEQARHKQKIVAPVKQLL

What can I do to fix it?

fasta IEDB format • 2.6k views

ADD COMMENT • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

6.8 years ago

GenoMax 141k

It looks like the tool is expecting linear sequence and your fasta appears to be multi-line. Most of the examples shown are small where as your sequence is pretty long. See if it works after linearizing it: Use @Pierre's solution here to linearize fasta.

ADD COMMENT • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

I did it, and it returns this now:

/builddir/build/BUILD/Python-2.7.5/Objects/listobject.c:135: bad argument to internal function

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

Can you try a short stretch from you sequence to see if the tool works as expected? Perhaps your sequences are too long?

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

It still returns the same message

/builddir/build/BUILD/Python-2.7.5/Objects/listobject.c:135: bad argument to internal function

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

1

Entering edit mode

I tried the example dataset available on the site and it returned proper results so the tool is definitely working. It looks like you need to provide eiptope sequences in Step 1 and then your actual sequence in Step 2. I assume that is what you are doing. See the example data here.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

This is exactly what I've done. Take a look:

The input:

enter image description here

The result:

enter image description here

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

1

Entering edit mode

Looks like a bug to me... you did select ">=100%" identity, so I would not expect the query to yield anything useful. But, the error message is about an unsupported format. The tool clearly indicates that it supports fasta, and this appears to be correctly-formatted fasta. From looking at the website, it appears to be protein-oriented; so the fact that your fasta file contains AA's rather than NT's should be a problem. Thus, I suggest you report this to the authors.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

I'm gonna report it, but I'm in hurry with these analyses! Do you know any alternative tool?

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

You could download the standalone tool and run it locally.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

There is currently no downloadable version available for the Epitope Conservancy Analysis tool

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

There is the contact us link at the top, so perhaps you could report the problem and ask about the standalone download at the same time.

ADD REPLY • link 6.8 years ago by GenoMax 141k

score 2 · Accepted Answer · 2017-07-01

2

Entering edit mode

6.8 years ago

l.souza ▴ 80

In the IUPAC code, unknown amino acids must be represented with X while nucleic acids are presented as N, but this IEDB tool considers N as unknown AAs. So, I solved the problem replacing all Xs for Ns.

Edited:

As mentioned in the replies above, we can't replace X for N because it means Asparagine, but the problem is clearly the X among the sequences. Then, I handled the problem removing the sequences with X.

ADD COMMENT • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

Are you sure of what you doing?

asparagine  asn N

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

OMG! Thank you for this note. I almost ruined my analyses... I didn't remember of asparagine! Anyway, if I replaced X and it worked, the problem still must be in the character X. I'm gonna try to replace for something else!

Thanks again!!

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

If I check the accession number you included in the original post it leads to a nucleotide sequence record for a partial CDS of a gene. The translation included in the record does NOT contain any X's. In fact there are N's in place of where you have X's in your sequence.

/translation="TTSTGESADPVTTTVENYGGETQVQRRQHTDVSFILDRFVKVTP
                     KDQINVLDLMQTPAHTLVGALLRTATYYFADLEVAVKHEGNLTWVPNGAPEAALDNTT
                     NPTAYHKAPLTRLALPYTAPHRVLATVYNGNCKYGEGAVTNVRGDLQVLAQKAARTLP
                     TSFNYGAIKATRVTELLYRMKRAETYCPRPLLAIHPEQARHKQKIVAPVKQLL"

So you were doing the right thing but without checking the original sequence. Mystery is how did you end up with the incorrect sequence.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

It's because I have a set of about 2000 sequences. Some of them present unknown amino acids. When I edited the sequences, replacing the X for N, the server accepted all the sequences. So, it leads me to believe the problem is the X.

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

So, it leads me to believe the problem is the X

We have established that without a doubt. X is supposed to represent "Unknown" per IUPAC code. Question is how did you end up with X's in this particular sequence, when there should have been N's according to the original NCBI record. Perhaps your other sequences have the same issue?

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

I checked the sequence I wrote as an example and there isn't X in it. When I was unduly replacing the letters some mistake may have happened.

I looked another sequence that presents X in its translation and the codon responsible for the X is GCS. Looking into the IUPAC code, S means 'G or C'. Both GCG and GCC results in alanine, but the translation tool doesn't consider it and writes X. In another sequence, there's a codon SST which also leads to X.

ADD REPLY • link 6.8 years ago by l.souza ▴ 80

0

Entering edit mode

I checked the sequence I wrote as an example and there isn't X in it.

I see that you are from brazil so I assume the difference in your keyboard layout must be causing this. I see X's in the sequence posted in the original question.

ADD REPLY • link 6.8 years ago by GenoMax 141k

0

Entering edit mode

I mean I checked the original file. The sequence posted in the original question is wrong, with improper Xs in it. I didn't change the original post to keep the logic of the replies.

ADD REPLY • link 6.8 years ago by l.souza ▴ 80