Tips to speed up read.fasta() in R for large files?
4
1
Entering edit mode
8.9 years ago

I found some tips for speeding up read.table() here, I wondered if anyone can suggest something for read.fasta(), which is part of the seqinr library. Not that they're really comparable.

I have a ~6GB file.

fasta sequence R • 9.5k views
ADD COMMENT
0
Entering edit mode

Why do you need R? Can't you use another program for your project? linux pipeline, C, etc...

ADD REPLY
0
Entering edit mode

No particular reason, I'm slightly more familiar with R and I'm not sure how to translate my script into another language

ADD REPLY
2
Entering edit mode
8.9 years ago
SES 8.6k

Since that function is written in R I don't think optimizing your code will result in much of a speed up compared to using a library written in another language. If you want to work in R, it would probably be easier to use the system() command to transform your sequences with another program or just use command line as Pierre suggested. If you have a specific task I'm sure people can provide a solution.

ADD COMMENT
0
Entering edit mode

Thanks @SES.

My post on the script itself is here, which now works fine (but slowly for .fasta files) thanks to @Martombo's comment.

Any suggestions and examples on moving it over to another program or another language would be a great learning experience.

I have no particular allegiance to R, other than its been useful for me so far and I've found a lot of relevant posts for what I'm doing to be using R. I wasn't aware of its speed limitations until now.

ADD REPLY
2
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

If you have a huge file (just reread your post, 6Gb), the best way is to use an index (this is how databases work...). Take a look at this link: http://stackoverflow.com/questions/23173215/how-to-subset-sequences-in-fasta-file-based-on-sequence-id-or-name

It depends on what you want to do; do you only need a subset of these sequences at any given time or do you require every one of them? You would need a way to load and store them in your RAM, rather than reading a 6Gb file each time you tinker your file. NB: you can increase R's memory to avoid errors.

ADD COMMENT
0
Entering edit mode

definitely reading an indexed or maybe a compressed file is the solution here.

ADD REPLY
0
Entering edit mode

Sounds ideal!

However when I follow the instructions in that link, when I index my file in R, when I come back out the size of my original fasta is about a 10th of what it was originally, and its only indexing about 90 out of 2000 sequences (big sequences). What's happening there so that my original file gets overwritten!

library(Rsamtools)
indexFa("foo.fasta")

My fasta file is of the form:

>123456
AGTGATGATGATGATGATGCAGCTAGACTGCTAGCTACGTCAGACT(...)
>98765432
AGTCGATGCATCGCATACGACTAGCTCAGCATCGACTACGTACGATC(...)
ADD REPLY
1
Entering edit mode

Can you confirm this? You should get an index file in .fai format, and no changes to your old file. Please check if the lines with dna sequences have a consistent length.

EDIT

What would make fasta file indexing stop prematurely? same problem as you. Fixed by running the code below, rather than indexFa("yourFileName")

 library(Rsamtools)
 file_path <- "myfile.fa"
 indexFa(file_path)
ADD REPLY
0
Entering edit mode

Thanks cyril-cros. Indexing looks like the sensible way to go, however how would I now go about incorporating it into my script (below), which other than reading in this large fasta file works fine, where now instead of read.fasta() I'm additionally calling on a .fai file I assume to only read in one entry at a time? I'm unfamiliar with how index files work basically.

#!/usr/bin/Rscript

library(seqinr)

infas=read.fasta("test.fa")
posi=read.delim("positiontest.txt", header=FALSE, sep="\t")
outfas=file("orf2seqs.fa", 'w')

# construct function to extract subsequences from infas based on start=stop positions defined in posi. e.g. in posi there may be 3 subsequences defined per fasta file entry (gi) which are to be extracted, each line read into posi is one such subsequence: gi, start, stop, strand.

doitall=function(x) {
    seq_id <- x[[1]];
    start <- x[[2]];
    stop <- x[[3]];
    seq <- infas[[as.character(seq_id)]];
    strand <- x[[4]]
    seqv <- seq[start:stop]
    header <- paste(sep="", ">", attr(seq, "name"), ":", start, "-", stop, ."_(", strand, ")", "\n")
    cat(file=outfas, header)
    cat(file=outfas, toupper(paste(sep="", collapse="", seqv)), "\n")
}`

# execute function. invisible isn't necessary, just hides the NULL
invisible(apply(posi, 1, doitall))
close(outfas)
ADD REPLY
1
Entering edit mode
8.9 years ago
John 13k

There are only two ways to speed up loading data - buy a faster hard drive, or read less data from it :P

I can't help with the former, but for the latter you could try converting your FASTA files to 2bit and load that. A 6Gb FASTA file will only be 1.5Gb in 2bit format - and that's before compression. It might even fit into RAM as 2bit, and a quick Google suggests that there are packages in R which understand the 2bit format. I don't know if it then converts it to something else after reading it... who knows.

But 6Gb isn't much data really... hard drives typically read over 100MB/s, so unless 60 seconds is too long for you, I doubt getting more bang for your RPM is going to make a difference. Most likely the logic part of your program is the slow part, not the reading part. R is a famously slow language, particularly in its loops, which is why packages like plyr exist to try and speed things up.

But all of this is just speculation - since we don't know what your program does, we cannot really say how to speed it up. If you really want to speed up your code (and become a better programmer in the process) definitely check out this: http://adv-r.had.co.nz/Profiling.html

Good luck! :)

ADD COMMENT
0
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

I advise you to read the GRanges documentation on biocLite, as well as Rsamtools and its getSeq method from Rsamtools.

  • To read a fasta file efficiently using an index, use the method FaFile(filepath) - I don't know if read.fasta does the same
  • To store and manipulate annotation data or position records in the form <gi, start, stop, strand>, use a GRange object.
  • You can then access all the sequences corresponding to records in a GRange using the method getSeq
ADD COMMENT

Login before adding your answer.

Traffic: 2492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6