What would make fasta file indexing stop prematurely?
5
0
Entering edit mode
8.9 years ago

I'm following the pointers in this post, regarding the use of Rsamtools and the indexFa() function to index a large fasta file of around 7GB, about 2000 nt sequences. However only the first ~87 are indexed, I see no reason why it would stop at this point unless it reaches a limitation.

What's more, the file the function is operating on, indexFa("seqs.fa"), becomes truncated to these first 87 entries also (I don't know how the function works, but that was quite unexpected). Any solution to this or other methods of indexing a large fasta file to call in sequences by name to avoid trying to read in the whole fasta file with read.fasta() in one go (which gets Killed prematurely too).

Thanks again,

library(Rsamtools)
indexFa("seqs.fa")   # I think this is stumbling point
fa = FaFile("seqs.fa")
gr = as(seqinfo(fa), "GRanges")
getSeq(fa, gr[2:1])
idx = pmatch("296783888", names(gr)) 
seq = getSeq(fa, gr[idx])
fasta sequence gene R • 3.0k views
ADD COMMENT
1
Entering edit mode
8.9 years ago

several reasons can explain this

1. Fasta files has to have a first line starting with ">", then the name, then some optional commentaries. The next line must be compulsorily DNA data. If some commentaries go to the second line, the parsing of the fasta file can fail

2. The presence of letters which are not nucleic acid

If the fasta file fail in loading you have several choices to discover where. One is a Fasta Validator

ADD COMMENT
0
Entering edit mode

Thanks Antonio, on 2. would "N"'s qualify as non-nucleic acid? What's the standard approach to deal with N: remove or mutate?

ADD REPLY
1
Entering edit mode

"N" (ambiguous base call) is perfectly acceptable in FASTA DNA sequences.

ADD REPLY
0
Entering edit mode

Okay that's not the reason then. Nor is variable length.

ADD REPLY
1
Entering edit mode
8.9 years ago

To index the file using this method, all sequence lines (except the last) in each entry need to be of a consistent length. There should also be no blank lines between entries.

ADD COMMENT
0
Entering edit mode

Thanks Matt, my sequences fall short on the consistent length criterion. However I don't find this to be the case with some test files.

ADD REPLY
1
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

In R try using memory.limit(size=7000)to increase your RAM max allowance - it should be around 2Go by default. You need a 64bit system AND enough RAM obviously.

Size is in Mb, and just using memory.limit() will display your current setting.

ADD COMMENT
0
Entering edit mode

Thanks, I expect memory has held me back at one point here, though I've managed to use Rsamtools indexing by changing my commands slightly (above). +1

ADD REPLY
0
Entering edit mode
8.9 years ago

It might help if you could explain what you are trying to accomplish, and the characteristics of your computer, both physical and software.

ADD COMMENT
0
Entering edit mode

I am trying to read in a large fasta file to R, to extract subsequences from. It's too big for read.fasta() however so it was suggested I index the file and only call the sequence when its required for extraction; but I'm having problems with this Rsamtools idx as described above. I'm on a Mac OS if it makes a difference.

ADD REPLY
0
Entering edit mode
8.9 years ago

I'm coming to the conclusion its a memory thing, which I'd hoped indexing would circumvent. When I open the fasta file, its only loads up to the 87th entry, which is the same number that it is able to index in R. However if I grep the fasta headers "^>" I can see the rest (all 2000 odd) are still in the file, and you can tell it based on the file size too. Any way to expand memory or am I going to have to do it in 20 or 30 chunks?

ADD COMMENT
0
Entering edit mode

How big is the file?

ADD REPLY
0
Entering edit mode

7Gb

ADD REPLY
0
Entering edit mode

Can you index it using samtools faidx?

ADD REPLY
0
Entering edit mode

Is that different to the one I'm using above? I don't think it will index which is my current problem

ADD REPLY
0
Entering edit mode

It's a different implementation of the same method so might inform you of what the error is. Also could use "pyfaidx" which includes a faidx program that will tell you what causes the indexing to fail.

ADD REPLY
0
Entering edit mode

So turns out if I change the commands slightly the .fai file is produced in full, and it doesn't simultaneously cut down my orginal file size to a fraction of what it was:

 library(Rsamtools)
 file_path <- "myfile.fa"
 indexFa(file_path)

Previously I was running:

 library(Rsamtools)
 indexFa("myfile.fa")
ADD REPLY
0
Entering edit mode

That doesn't really make sense, but glad you got it to work!

ADD REPLY
0
Entering edit mode

I know. Don't look a gift horse in the mouth eh.

ADD REPLY

Login before adding your answer.

Traffic: 2269 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6