Hello,
I have a few questions regarding nhmmscan. I am very new to using hmms and hmmscan, etc.
I am trying to replicate Pendleton et al., 2015 identification of of MEI insertions in the PacBio sequenced NA12878 genome. They state in the second to last paragraph of the supplementary data that they used nhmmer with Dfam using the script 'dfamscan.pl' with default parameters.
I am only interested in identifying the L1Hs in this genome, and I know from the paper there are 118 of them.
I tried running the script on default parameters, but after a week of running (and no standard output to say it was actually doing anything), I killed it, and decided to run with just the L1HS 5' hmm instead. it's been going for over 24 hours.
I guess my first question is, am I running it right? My command is as below:
perl /media/RAID/rdunbar/hmmer/dfamscan.pl -fastafile /media/RAID/rdunbar/hmmer/corrected_reads_gt4kb.fasta -hmmfile /media/RAID/rdunbar/hmmer/L1HS_L1/DF0000225.hmm -dfam_outfile /media/RAID/rdunbar/hmmer/Results/PacBio_Dfam_hits_DF0000225.out
The hmm file was obtained from here: http://dfam.org/entry/DF0000225
The fasta file is the cleaned reads from PacBio NA12878 run, and is 60G: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/corrected_reads_gt4kb.fasta
My second question is with a 60G fasta file, and only searching for 1 element hmm, how long roughly should this take?
Running top shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6231 rdunbar 20 0 960284 93000 3188 S 94.4 0.2 1348:37 nhmmscan
Also, top seems to show nhmmscan as almost constantly in S mode. In the terminal, this is all that has been displayed since yesterday:
rdunbar@plymouthcruncher:/media/RAID/rdunbar/hmmer/hmmer-3.1b2/src$ ./nhmmscan /media/RAID/rdunbar/hmmer/L1HS_L1/DF0000225.hmm /media/RAID/rdunbar/hmmer/corrected_reads_gt4kb.fasta > /media/RAID/rdunbar/hmmer/Results/PacBio_nhmmer_hits_DF0000225.out
Any help would be most appreciated.
Kindest regards, Roxane
Congratulations!