Question

Collapse Repeated Reads

1

Entering edit mode

12.7 years ago

Cdiez ▴ 150

Hello,

I'm analyzing a siRNA library and after receiving the data from the sequencer I applied "Cutadapt" to tream the adapter sequences and filter the reads by quality. Now the next step before mapping is to collapse the repeated reads. To do so, I have tried to use "Fastxcollapser" from Fastqtools but it doesn't work because apparently the file I want to collapse is too large (6.9GB). Does anyone knows any other program able to collapse big files?

Thanks a lot!

illumina fastq short • 7.5k views

ADD COMMENT • link updated 12.7 years ago by lh3 33k • written 12.7 years ago by Cdiez ▴ 150

0

Entering edit mode

I would run your analysis both ways - collapsed and uncollapsed. The chances of seeing two identical short rna's that are biologically real are much higher than seeing two identical reads from a coding RNA-Seq run

ADD REPLY • link 12.7 years ago by Jeremy Leipzig 22k

score 8 · Answer 1 · 2011-08-21

8

Entering edit mode

12.7 years ago

lh3 33k

The best way is unix sort.

awk 'NR%4==2' in.fastq | sort -S1900G | uniq -c > output.txt

I used to do similar things for a >100GB file. 6.9G should be done in minutes if I am right. Actually for a file as small as 6.9GB, the fastest way is to load all the data into memory and then do a quicksort. This would not take more than 4GB memory. If you do not care about reads containing "N", 2GB is enough.

ADD COMMENT • link 12.7 years ago by lh3 33k

0

Entering edit mode

Sorry but I'm not familiar with UNIX (I'm just starting to analyze this kind of data) and I don't get what is the meaning of 'NR%4==2' and -S1900G in the command line. Although probably it's a silly question I'd be very grateful if you could add some comments about it. Thanks in advance!

ADD REPLY • link 12.7 years ago by Cdiez ▴ 150

0

Entering edit mode

At this point, output.txt contains each unique sequence and the count of occurrences, you'll still need some kind of lookup to pull out the full read with the header and quality info.

ADD REPLY • link 12.7 years ago by brentp 24k

0

Entering edit mode

My guess is when we prefer to collapse reads, we disregard quality and read names; otherwise we would not want to do this in the first place. Nonetheless, I do not do chip/mirna-seq. Could be wrong.

ADD REPLY • link 12.7 years ago by lh3 33k

score 1 · Answer 2 · 2011-08-19

1

Entering edit mode

12.7 years ago

brentp 24k

A while ago, I wrote up one way to do this using a bloom filter.

The code is essentially:

iterate over reads and put them in a bloom filter.
if subsequent reads appear in the filter, put them in a python set.
that set is putative matches (a bloom filter can give false positives)
iterate over the reads again and check against the set.

The python package I used for that is here. You can get a tar ball of that here. After you untar it, you can run

$ sudo python setup.py install

And it will put bloomfilter in your python path. (You may need to have cython installed).

That will allow you to use a lot less memory than storing the reads in a hash. You could also have a look at khmer which has a few scripts that might get you started.

ADD COMMENT • link 12.7 years ago by brentp 24k

0

Entering edit mode

Thanks a lot for you answer! I've tried to run your script but it seems that I need to have installed some "extra" python-packages (bloomfaster) to run it. Please could you indicate me briefly which are these packages and how I could install them to make the script work? Thanks in advance!

ADD REPLY • link 12.7 years ago by Cdiez ▴ 150

0

Entering edit mode

cdiez, I have updated the answer to include a bit on where to get bloomfaster and how to install it. Hope that helps.