Question

BLAT against nt

0

Entering edit mode

7.3 years ago

Assa Yeroslaviz ★ 1.8k

Hi,

I am looking for a possibility to download the complete nt DB in 2bit format. I need it to run a BLAT search against my unmapped reads from a RNA-Seq experiment. I have a lot of unmapped reads and one way to try and identify the source of these high amount of reads was to try and run a BLAT search against the complete nt. BUT BLAT takes only the 2bit format as an input. I know I can convert the fastA into 2bit, but I was wondering if there a better way than to split the fastA file into subsets (the faToTwoBit script from UCSC can't handle files bigger than 4GB).

thanks in advance

Assa

BLAT nt contamination 2bit fasta • 2.2k views

ADD COMMENT • link 7.3 years ago by Assa Yeroslaviz ★ 1.8k

1

Entering edit mode

You'll need to split the fasta file regardless of what you do, since the 2bit format itself can't handle more than 4GB (i.e., it doesn't matter what program you use). Maybe just use kraken or something like that instead.

ADD REPLY • link 7.3 years ago by Devon Ryan 104k

0

Entering edit mode

oh! I didn't know that. thanks

ADD REPLY • link 7.3 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

If I want to run BLAT against the complete nt (let us just assume i want to do it), I have downloaded the complete nt and decompress it (134GB), Now I split it into 46 parts (each 3000MB). I am trying to run the BLAT via gfServer-gfClient.

I was thinking doing it like that:

gfServer start localhost 12345 ./2Bit.Files/*.2bit &

for file in `ls -1 *.fa`
   do
   NEW_FILE=$(echo $file | sed -E "s/(.*mate.*).fa/\1/")
   gfClient -t=dna -q=dna -out=blast8 localhost 12345 ./2Bit.Files/ $file $NEW_FILE.txt
   gfClient -t=dna -q=dna -out=pslx localhost 12345 ./2Bit.Files/ $file $NEW_FILE.psl
done

But Do I need to run the gfServer command for each of the 2Bit files separately?

ADD REPLY • link 7.2 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

Blat of DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments.

Is that the aim of this search? How many reads are in the "unmapped" pool?

If you are doing this on the command line why not do the blat directly without the client/server layer?

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

I have over 20 files of unmapped samples (each from mate1 and mate2 of paired-end RNAseq data. Each fastA file has several millions sequences. I thought it will be more efficient to run it as a server. I'll try to do it separately as it seems the last possible option. thanks

ADD REPLY • link 7.2 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

Generally taking a random sample of 20-30 reads and blasting should be sufficient to identify major genomes present (unless you are expecting metagenomic contamination or need to identify every read that is unmapped).

ADD REPLY • link 7.3 years ago by GenoMax 141k