Bowtie indexing of a fasta file that consists of a large amount of sequences
0
0
Entering edit mode
7.5 years ago
valerie ▴ 100

Hi

I have a fasta file with 44396570 sequences and I want to find them in my genome. I am using bowtie2-build to build index files for my fasta file, but I get an error "killed". I guess it's because I need more memory. I have 32gb on my machine. I tried to split my file in 2 and 4 parts but it didn't solve my problem. I also made sure that bowtie2-build works on one single sequence from my fasta file and everything was ok, but bowtie2 alignment of my whole genome on this single sequence took a huge amount of time and I am worried that splitting my fasta file into more parts will take years to perform the alignment.

Maybe someone already solved this problem, when you want to find a lot of sequences in your genome?

I would be grateful if you could help me.

bowtie alignment genome • 4.5k views
ADD COMMENT
1
Entering edit mode

Normal procedure is to use your sequences and search against the genome, is there a reason you want to do this the other way round? Are your sequences short reads or long(er)? How large is the "genome" in this case?

Many NGS aligners are not designed to search with very long sequences (say > 1kb, with a few exceptions, e.g. blasr from PacBio). BBMap contains mapPacBio.sh which can map long reads (in 6kb chunks) and bwa also works with PacBio/Nanopore reads.

ADD REPLY
0
Entering edit mode

I have experimental WGS data and I am interested in presence or absence of my sequences in genomes of these exact patients, not in reference genome. My sequences are longer then reads. They are different length, but something like gene length, not shorter.

ADD REPLY
1
Entering edit mode

Have you thought about using blat/blast/lastz? Depending on how similar you expect your sequences to be when compared the "genome" you could go from "near identical to more diverse" in rough order from left to right for the programs above.

ADD REPLY
0
Entering edit mode

Did not think about blat, thank you! I expect the sequences to be identical so I think it should work. May I also ask you one more question? My genomes are in fastq format, should I align them on human genome or convert somehow into fasta to use blat? Thank you in advance!

ADD REPLY
1
Entering edit mode

I now understand what you are trying to do.

So your "genomes" are actually short reads which you want to align against some other longer sequences. You could convert them to fasta and then use one of the programs above but that may still be a long endeavor since you have many (milions?) of those reads.

What is the total size of your long reads (in terms of file size/mega-gigabases of sequence)? You could use bbmap.sh from BBMap suite to align your genome reads (in original fastq format) to the other dataset. This may be the best option in terms of getting this done rapidly.

ADD REPLY
0
Entering edit mode

The size of the "genomes" is ~200gb for each of the 2 paired-end files. The size of the fasta file with my sequences I want to find is 24gb. Is it ok for BBMap?

ADD REPLY
1
Entering edit mode

Now things are coming into focus :)

24G of "reference" would not be good for most aligners with the amount of RAM you have available. If you had plenty of RAM then this would be no problem.

How about using "divide and conquer"? How many sequences are there in the "fasta" file? You could divide them in 8 x 3Gb chunks (which should be about the size of human genome) and then do 8 independent searches.

ADD REPLY
0
Entering edit mode

Haha :) There are 44396570 sequences in it. I tried to split into 4 parts, failed, got upset and came here :) I can definitely try to split the fasta file into 8 parts. Do you think I should use BBMap or Bowtie?

Thank you very much for your help! You are my hero :)

ADD REPLY
1
Entering edit mode

Before we celebrate try one 3 Gb chunk out first. I am partial to bbmap but either aligner should work at that size :) Keep an eye on free disk space and add more if you can, just to be proactive. With bbmap you can directly create bam files as long as you have samtools available in $PATH.

At 200G each (R1/R2) you must have a boatload of reads in there (billions?).

ADD REPLY
0
Entering edit mode

Ok, I'll do that now! Anyway, thank you very much!

ADD REPLY
0
Entering edit mode

Yes. Originally I had .bam files provided by the guys who did the experiment. Each file was ~150G and I converted each .bam file into two .fastq files with bedtools bam2fastq. And now I have two 200G for each genome.

ADD REPLY
0
Entering edit mode

Tada! I started with bowtie and bowtie-build part is completed without any issues, now I'm waiting for the alignment! Thank you very much!! You definitely are my hero! :)

ADD REPLY
0
Entering edit mode

Glad to hear that appears to be working.

BTW: You don't need to up vote each one of my posts. Only one (or some) that helped provide a solution/direction would be sufficient.

ADD REPLY
0
Entering edit mode

Wow. Didn't know this feature, thank you for the information. How do people solve these issues then? I thought that standard procedure is to align the WGS data on whole chromosomes. Maybe I could concat all my sequences into one sequence, like genes on the chromosome and align everything on it and then search whether reads mapped each of the parts of the sequence? Or is this crazy? :) Sorry for stupid questions, I am doing this for the first time

ADD REPLY
1
Entering edit mode

Please see the edited post above since I made some changes there.

What is the size distribution of your reads and are the "genomes" you are searching against in form of chromosomes or contigs or something else?

ADD REPLY
0
Entering edit mode

My genomes are two fastq files with paired-ended reads for each of the genomes. The size of reads is 100. I am searching against sequences that are actually specific combinations of exons of human genes. This is why their length is similar to length of genes. Each combination is a different sequence and there are thousands of these combinations.

ADD REPLY
1
Entering edit mode

I am surprised that you were not able to build the index with bowtie2-build if your sequences are a "transcriptome". Did the job get killed right away or it ran for sometime before it died? I wonder if you ran out of disk storage space rather then memory (32G should be enough, if all of it is available for the OS).

ADD REPLY
0
Entering edit mode

I have 200 gb on my disk and can add more. It died while doing something with the "buckets", unfortunately I didn't save the output and do not remember exactly when it died. It worked for something like half an hour before died.

ADD REPLY

Login before adding your answer.

Traffic: 2539 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6