Question

Manually curate gene annotation by visualizing on IGV or tablet or any better visualization tool

0

Entering edit mode

5.0 years ago

DanielC ▴ 170

Dear Friends,

I have the following data below and I want to manually curate the data using a viewer; could you please help me understand on how can I visualize these data and manually curate the annotation on IGV or tablet or any other tool. I haven't used IGV before. I would really appreciate your input.

Data:
 --> annotated gff file from prokka of the assembled contig

--> gff file of the 20 best blast hits (of the assembled contig)
The GFF file looks like this:



KC139526.1      BLASTN  hsp     13946   15668   0.0     -       0       Match CBphage_assembly-spades-25000-readsae:NODE_1_length_88156_cov_48.969578
    KC139526.1      BLASTN  hsp     34229   36058   0.0     -       0       Match CBphage_assembly-spades-25000-readsae:NODE_1_length_88156_cov_48.969578
    KC139526.1      BLASTN  hsp     36062   37169   0.0     -       0       Match CBphage_assembly-spades-25000-readsae:NODE_1_length_88156_cov_48.969578
    KC139526.1      BLASTN  hsp     13190   13696   0.0     -       0       Match CBphage_assembly-spades-25000-readsae:NODE_1_length_88156_cov_48.969578
    KC139526.1      BLASTN  hsp     37315   37446   7e-50   -       0       Match CBphage_assembly-spades-25000-readsae:NODE_1_length_88156_cov_48.969578


--> assembled contig fasta file

--> SAM file of the reads mapped onto the assembled contig

--> protein domain hits from interproscan for the annotated genes (for functional annotation) in GFF format

What I am looking for is to visualize above data in below order (if possible) like this on a viewer:

assembled contig prokka annotated genes interproscan annotation of genes 20 blast hits reads

The order can change; this is just a picture of what I am looking to visualize. Thanks!

annotation visualize gene curate • 3.4k views

ADD COMMENT • link 5.0 years ago by DanielC ▴ 170

0

Entering edit mode

Thank you.. I will try IGV too, but for now I am using genomeview as Lieven susggested for annotation editing. However, when loading the data on Genomeview, I see an issue. When loading the file

"--> protein domain hits from interproscan for the annotated genes (for functional annotation) in GFF format
"

The info in the file looks like this: (obtained from program "interproscan" https://github.com/ebi-pf-team/interproscan/wiki/HowToRun)

C600S1Bphage_readsbu    getorf  ORF     6115    7533    .       -       .       Target=pep_C600S1Bphage_readsbu_6115_7533_r 1 473;ID=orf_C600S1Bphage_readsbu_6115_7533_r;Name=C600S1Bphage_readsbu_1696;md5=3e232381f00375467a42a29af36c7a82
C600S1Bphage_readsbu    getorf  polypeptide     1       473     .       +       .       ID=pep_C600S1Bphage_readsbu_6115_7533_r;md5=d5a16d4fa73bd6f3aff60da7cce93b82
C600S1Bphage_readsbu    Pfam    protein_match   64      158     4.2E-13 +       .       date=25-04-2019;Target=null 64 158;ID=match$20_64_158;signature_desc=Tail fibre protein gp37 C terminal;Name=PF12604;status=T;Dbxref="InterPro:IPR022246"
C600S1Bphage_readsbu    Pfam    protein_match   170     334     2.2E-20 +       .       date=25-04-2019;Target=null 170 334;ID=match$20_170_334;signature_desc=Tail fibre protein gp37 C terminal;Name=PF12604;status=T;Dbxref="InterPro:IPR022246"
C600S1Bphage_readsbu    Pfam    protein_match   2       59      1.2E-11 +       .       date=25-04-2019;Target=null 2 59;ID=match$20_2_59;signature_desc=Tail fibre protein gp37 C terminal;Name=PF12604;status=T;Dbxref="InterPro:IPR022246"
C600S1Bphage_readsbu    Gene3D  protein_match   75      133     1.5E-6  +       .       date=25-04-2019;Target=null 75 133;ID=match$21_75_133;Name=G3DSA:2.20.220.30;status=T
C600S1Bphage_readsbu    Gene3D  protein_match   181     243     2.9E-6  +       .       date=25-04-2019;Target=null 181 243;ID=match$21_181_243;Name=G3DSA:2.20.220.30;status=T

The input for interproscan is my assembled contig of size 88315.

After running When I load this interproscan GFF file on genomeview viewer I see an issue. From the example above, the ORF (6115 7533 ) annotated by interproscan aligns well with the reference contig on the viewer because the sequence number matches; however, the proteins annotated don't align to the reference contig on the viewer because the numbering of the protein residues in this case for example is 64 to 158, so it aligns to 64 to 158 of the reference contig instead of aligning within the region 6115 to 7533. Please see in the attachment (https://ibb.co/XZJKZNx), you will see that all the proteins annotated by interproscan are accumulated in the beginning of the reference contig in the viewer. Can you please let me know how can I align the respective protein to its respective position on the reference contig? I would be very thankful for your response and solution. Please let me know if something is not clear.

Thanks

ADD REPLY • link 5.0 years ago by DanielC ▴ 170

0

Entering edit mode

Can you remove this post and add it in again as a comment on for instance my answer as this is not really an answer to the original question.

this way we try to keep the threads logically organised. thx

and yes you're observation is correct. The viewer (all of them btw) will always use the genomic sequence as reference and thus not the protein to which the interproscan domains are mapped.

In general I do not see any advantage to load interproscan result in a genome browser (when the goal is to structurally annotate genes). This is only usable to add potential functions to genes, which is not something you should do via a genome browser/editor

ADD REPLY • link 5.0 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks. I see, I understand the point. From the data I have could you please tell me how can I manually correct the annotation; I mean what logic should I be using to correct the annotations obtained from prokka and to provide functional annotation. Thanks.

ADD REPLY • link 5.0 years ago by DanielC ▴ 170

score 0 · Answer 1 · 2019-04-24

0

Entering edit mode

5.0 years ago

lieven.sterck 15k

Genomeview is very suited for this goal .

It can visualise all these things, allows on-the-fly editing of gene models and support multiple output formats and much more ...

ADD COMMENT • link 5.0 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks, when I try to load the SAM file it does not load and could not recognize the format. Could you please let me know how to load SAM or BAM file (reads mapped onto the assembled contig).

Thanks!

ADD REPLY • link 5.0 years ago by DanielC ▴ 170

0

Entering edit mode

you will need to provide sorted sam files I think. I checked and it needs to be BAM format it seems, so transform you sam to bam and then load it

On the website there is ample documentation on how (and what ) to load : http://genomeview.org/manual/Main_Page

ADD REPLY • link 5.0 years ago by lieven.sterck 15k

score 0 · Answer 2 · 2019-04-25

This answer covers the visualization part in IGV only. Whether you need additional tools to annotate the genome, such as Artemis or WebApollo is up to you. I think it is overkill for such a small genome like a phage, you are possibly better suited editing the GFF file directly. IGV does not allow editing.

Visualizing the data in IGV is straight forward. First download and install it, then you can simply load all files using File-open, it supports, FASTA, GFF, SAM and BAM formats. I recommend you generate a .genome file first:

Menu: Genomes -> Create .genome file
Enter a unique id, your fasta file and genome annotation GFF -> OK
Open the .genome file: Genomes -> load genome from file
Open the rest of your files (Blast GFF, BAM/SAM) using File -> Open

Example: enter image description here