Question

blast against metagenome shotgun data - not getting the hits expected

1

Entering edit mode

5.3 years ago

christina ▴ 10

Hi all,

I am currently performing blast searches for a species of interest by blasting a genome assembly against custom databases of metagenome shotgun data.

Everything seems to be working fine and I am getting hits, if I create a database from a single sample (i.e. paired reads, converted fastq to fasta). But, as soon as I create a custom database from multiple samples, I am only getting hits for a single sample, although I know that there must be hits in the other samples as well, since I tried making databases from every single sample.

So for example:

Lactobacillus crispatus gets - 578 hits from sample A - 1606 hits in sample B - 614 hits from sample C if I blast against individual databases for each sample

Combing the three samples in one custom database, I get 614 hits (and the subject IDs in the output are all from sample C). Not a single hit from sample A or B.

I have also had a look at sequence identity between samples (since its 150bp reads) and although there are some identical reads, I still should get plenty unique hits from each of the other samples.

Does anyone have an idea as to why it behaves that way? I have the feeling that its the sample with the largest amount of sequences, compared to the others. Or alternatively, alphabetically the last one in the database.

Any thoughts are much appreciated! Thanks, Christina

metagenomics blast • 2.3k views

ADD COMMENT • link updated 5.3 years ago by 5heikki 11k • written 5.3 years ago by christina ▴ 10

0

Entering edit mode

It is not clear what is your blast database. Are you creating the database from the raw shotgun metagenomic reads?

Could you post the commands you used?

ADD REPLY • link 5.3 years ago by h.mon 35k

0

Entering edit mode

I downloaded paired raw read fastq files from the SRA archive. Converted to fasta using:

   cat sampleA_R1.fastq | paste - - - - | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > sampleA_R1.fasta

combine both runs using

cat sampleA_R1.fasta sampleA_R2.fasta > sampleA.fasta

then create custom database

makeblastdb -in sampleA.fasta -input_type fasta -dbtype nucl -title sampleA -parse_seqids -hash_index -out sampleA

I have done an individual database for sample A, B and C. But I have also made a database for all three samples, by combining fasta files using the cat command.

I want to do many more samples, so creating individual databases for each sample is too tedious.

ADD REPLY • link 5.3 years ago by christina ▴ 10

score 1 · Answer 1 · 2019-01-14

1

Entering edit mode

5.3 years ago

5heikki 11k

You're doing it wrong. You shouldn't create blast databases from reads. As to your results, you're probably hitting some default cap, like e.g. -num_alignments (250) or -max_target_seqs (500) depending on your output format type

ADD COMMENT • link 5.3 years ago by 5heikki 11k

0

Entering edit mode

Thanks for your reply. I had decided to do it this way around, simply because it is less computationally intensive. But I have since tried the reverse and that is working. I am still puzzled by the behaviour though. I tried some other samples and for those I got results that were concordant between individual databases and the concatenated database. That doesn't speak for the default caps, but I will have a look!

ADD REPLY • link 5.3 years ago by christina ▴ 10

score 0 · Answer 2 · 2019-01-13

I have no idea why your blast is behaving this way, but I think you should be performing the searches the opposite you are currently doing: create the database with the genomes of interest, and search the reads against this database. You could use blast (but this would be slow), or some kmer based search software, such as Kraken2.