Maker annotation failure
1
0
Entering edit mode
8.8 years ago
steven ▴ 70

Hello, I am using the latest version of MPI maker: http://www.yandell-lab.org/software/maker.html

My maker jobs have been running since the 16th, and some time during the 19th, the jobs stopped making progress. They all stopped after a series of four RETRY commands, then four FAILED commands --> it never reaches DIED_SKIPPED_PERMANENT.

I ran maker on an assembled genome of ~200,000,000 base pairs, with ~30,000 est and protein sequences from the same order.

Errors: http://pastebin.com/GX72Cimf

Assembly • 5.0k views
ADD COMMENT
1
Entering edit mode

You need to provide more information. Like, what commands did you run, what is the error message you got? What sub program fails in maker etc. With the amount of information you've given, I don't think anyone can help you.

ADD REPLY
1
Entering edit mode

200,000,000 scaffolds?! Is each read its own scaffold? This sounds a bit fishy to me.

ADD REPLY
0
Entering edit mode

Sorry...I meant that to read "base pairs", nice catch

ADD REPLY
1
Entering edit mode

From the errors, you might want to check: (1) if all the blast executebles are in path (maker_exe.ctl will auto configure while installing) (2) if est sequences are DNA and proteins are amino acids (3) all your input sequences have unique ids (preferably short), if not recode them to just numbers (4) enabled repeat masking (you'll never be able to complete predictions without masking all the repeats).

ADD REPLY
0
Entering edit mode

What exactly is repeat masking and which options would I want to change?

I'm going to try maker with the newest version of BLAST tomorrow. Also I downloaded all of the ests/proteins directly from the respective NCBI databases using biopython, so I don't think there are any issues there, I confirmed they were all in fasta format with a quick script. I will also try recoding them to numbers, there may be duplicates (species sequences also being under order sequences)

ADD REPLY
1
Entering edit mode

Hi, so couple of questions to make it easier to help you.

In which format did you provide the EST and proteins (fasta,fastq,something else)? can you maybe give an example (head -n 20 est_file > example_est.txt)

(head -n 20 protein_file > example_prot.txt)? I had the experience that maker is really picky about the headers it can handle.

I see for example a line that says Title is very long: 1038 characters (max is 1000) so maybe your headers are really huge

What does your maker_opts.ctl file look like? Maybe there is a wrong path somewhere or you forgot to turn on some setting.

Also to answer you question about repeatmasker, it finds low complexity repeats, transposons etc and masks them with N's before annotations. This way you get an annotated file with repeats and it reduces computational time during the annotation of the rest of the genome.

Have you looked at the GMOD training? It explains all of that stuff in detail.

http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014

ADD REPLY
0
Entering edit mode

The ESTs and proteins were in fasta format. I used usearch to create centroid fasta files after downloading all available sequences off of ncbi with BioPython. Additionally I grep'd out all of the headers, and none immediately looked 1000 characters long but I did not confirm this programatically. Here is an example of each:

ests: http://pastebin.com/NAJ2fHY4

proteins: http://pastebin.com/9T6m0UGi

opts file: http://pastebin.com/803dEXRL

I replaced the paths for privacy, but they are all valid paths. For altEST, I separated two paths with a comma. I am currently running Maker with JUST the species nucleotide sequence file to see if it completes without errors (protein2genome and est2genome turned off). Thanks for the reply

ADD REPLY
2
Entering edit mode
8.8 years ago
Lesley Sitter ▴ 600

Hi,

And your scaffolds also don't have very long headers?

Maybe try;

grep '>' fasta_file.fa | wc -L

this should give the length of your largest header.

So one problem I had, but I don't remember if it was with pre-process step or with Maker/blast itself was gi headers couldn't be processed properly. The | character, blank spaces and * gave errors.

Maybe make a small subset of you genome assembly (for example only 1 chromosome/scaffold/contig) and test if using EST and Prot data that does not have these characters in the headers works for you

sed 's/[^=>]*|*|//' file_in.fa > file_out.fa      # Remove the character |
sed '/^$/d' file_in.fa > file_out.fa              # Remove blank lines
sed '/\*$/d' file_in.fa > file_out.fa             # Remove the character *

One last possibly remark I can make is that it might be a problem is you having set two paths for alt_est files

Have you tried concatenating both fasta's into one and just adding one path? I never read anywhere that MAKER is able to handle multiple paths in its variables, but that might just be something i missed because i never needed to do it.

Let me know if anything worked, and if not I cannot figure out anything wrong here sorry

ADD COMMENT
0
Entering edit mode

So, I believe I was getting all of the errors because of an older BLAST version - I updated BLAST and now there are no errors in the output file. My only concern is this warning, do you think this will cause any problems with tblastx?

/common/opt/bioinformatics/ncbi-blast-2.2.31+/bin/tblastx: /lib64/libz.so.1: no
version information available (required by /common/opt/bioinformatics/ncbi-blast
-2.2.31+/bin/tblastx)
ADD REPLY
0
Entering edit mode

It seems to be outdated zlib problem, you probably just have to update it and the problem will be gone if i'm reading it correctly

http://sourceforge.net/p/samtools/mailman/message/25005099/

(ignore the fact that the comments are about samtools, just read comments below it)

ADD REPLY
0
Entering edit mode

thanks, well i got the "Maker has completed !" message so looks like everything worked out

ADD REPLY

Login before adding your answer.

Traffic: 2512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6