BLAST based on 100k sequences rnaseq?
1
0
Entering edit mode
5.7 years ago

Hello I think I have a quick question:

I received a demo dataset of fastq rna-seq data: My assignment is to find out from which organism they are. I tought of using BLAST to perform this task. I followed the biostar handbook, first converting the fastq sequences to fasta, and then using the BLAST web tool.

When entering my file with 100k sequences, it will return an error. When feeding it my first 5 sequences, I get the results for the first one, and I can then choose the results for the next files from the drop down menu.

So I understand it will present me the results for just a copule of sequences. But is there a functionality that will evaluate the results of all 100k sequences?

I don't really know if I should make an assembly, or whether I should invest in running the blast locally, and then looking at the organisms of the #1 result from each blast...?

RNA-Seq blast • 2.2k views
ADD COMMENT
1
Entering edit mode

I'd agree with using a tool like Kraken/Centrifuge. Blasting that many sequences will take a long time (I've done it), and the data that comes back is so enormous is almost impossible to do anything with it. I once BLASTed a set of sequencing reads to figure out what was in it, even with some very stringent BLAST parameters, it took a week to complete, and the output file was 19 GiB. Too unweildy to do anything with, but I was curious...

ADD REPLY
0
Entering edit mode

I don't know all that much about blasting and the resources it takes. You think I can figure something out that returns me the organism of the first match of each sequence? Then I'd be able to just make a report of those...

ADD REPLY
0
Entering edit mode

It would be good to clarify if you have any reason to believe (after the initial blasts) that there is more than one organism represented in your data.

ADD REPLY
0
Entering edit mode

Is it assumed that they're all from the same organism? If yes, maybe you should just subsample e.g. 10 reads and blast those?

ADD REPLY
2
Entering edit mode
5.7 years ago
GenoMax 141k

Are you only looking to see if there any contamination or are you interested in "everything that is in there and I expect lots of different things" scenario?

For 100K sequences you need to run blast locally. Instead of blast using DIAMOND against NCBI nr collection would be one option. This does require pretty beefy hardware. Plast is also another option.

You can also look into Kraken or kaiju, if these are metagenomic type sequences.

ADD COMMENT
1
Entering edit mode

Or centrifuge, which comes from the same lab as Kraken. If you use this fork, you can even run it with shared memory in an hpc environment.

ADD REPLY
0
Entering edit mode

Thanks for those suggestions. Maybe some more info: I have received rna-seq data and I want to know what organism it is from, that is the assignment. Since the only thing I received are the reads, I can't just choose a certain gene/protein to blast.. So - should I blast every sequence, and then look at the common highest ranked organism for each sequence? They assured me that the set was small enough. It contains 100k sequences of 49bp each.

I need to perform some other tasks, and rna-seq analysis is new to me, that's why I wanted to use the web based tools available instead of figuring out how to make it work locally.

ADD REPLY
1
Entering edit mode

Web-based with RNA-seq is not going to work at all, at least not in a feasible manner ...

This definitely sounds like a task for centrifuge. However, you need to be aware that given the composition of current sequence databases, your results are not necessarily what is actually in the sample. You might be safe at the genus level, but unless your organism is actually present in the database, you might not get the exact species. Is this an assignment from class? In that case the latter concern is probably void, unless they want to see how you tackle this kind of problem.

ADD REPLY
1
Entering edit mode

I have received rna-seq data and I want to know what organism it is from, that is the assignment.

That specific task is a simple blast search with 10 reads at most (unless you have been given a deliberately contaminated sample).

ADD REPLY
1
Entering edit mode

If this is a homework/assignment question you should be transparent about that in the opening post.

If that is the case, I doubt they’ve have given you mixed data, in which case follow genomax’s suggestion of just blasting a handful of reads and any one of them will likely reveal the original organism.

ADD REPLY
1
Entering edit mode

But only if it is an artificial assignment or something on a well-represented organism. I have RNA-seq data, which definitely does not tell you the species (I know what it is). As it is not represented in, say, nt or nr, I would even go so far that not one single read in my dataset will hit the species or even genus. The closest I get is to the subfamily level with a blastx/blastn search, but there are a lot of reads floating around that match within the same class or even kingdom.

ADD REPLY
1
Entering edit mode

An artifical assignment was what I meant.

What you are talking about is a different issue though.. If it doesn't exist in nr/nt, then why would you ever suppose BLAST would work in the first place?

If its a novel organism, the best you can do is try to place it phylogenetically. You could in principle do this with RNA reads - it might not be the ideal data for the purpose, but it would be doable to some degree.

ADD REPLY
1
Entering edit mode

I was responding to genomax's "10 reads at most" and giving an example where that doesn't work. But of course, you're right, if you know beforehand it's not in nt/nr, then you'd not blast against those dbs for species-level identification purposes.

ADD REPLY
0
Entering edit mode

Thanks everyone, I'll make sure to be clear about whether it is an assignment in the future. I doubt they'll give me some weird data, but I already thought that blasting a handful of sequences has its caveats. From your discussion I believe that it's not clearly defined what to do in case it IS data from a novel organism/mixed data.

ADD REPLY
1
Entering edit mode

In all normal RNAseq datasets using a few reads is generally sufficient to identify organism in question, at genus level for certain. If you know upfront that you are working with an uncharacterized sample then it is a special case.

If this is indeed an assignment (not confirmed as such) then this part mentioned in the original question is perhaps most important

But is there a functionality that will evaluate the results of all 100k sequences?

What species the sequences belong to is perhaps not as important.

ADD REPLY
0
Entering edit mode

Yes I only need to identify the organism in order to proceed with mapping the reads on it's reference genome. I already looked at a handful of blast results, but I just wondered whether there was another way where I could report this in more detail. I will update my original question.

ADD REPLY
1
Entering edit mode

If you now know the organism you are working with then find their genome page from NCBI's genome section. Then from that page you should be able to download the protein sequences (here is an example of human genome, protein download links should be up at top of the genome page). You can then create a blast index and then blast your reads against this set.

ADD REPLY

Login before adding your answer.

Traffic: 2802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6