Question

Screen reads based on certain sequences. Write code myself or designated tools.

0

Entering edit mode

7.3 years ago

omer.k ▴ 110

Hi Guys, I intend on acquiring transcripts (cDNA) from certain human-cell line culture. I'd be acquiring all the reads using the MinION. The reads are expected to be 2-10kb (don't know yet the distribution).

From all the reads I obtain, I'd like to move forward to analysis with only certain read:

The reads which will have a certain known tagging sequence (45 bases on the 3' end).
The reads which will have and expected, known, open read frame (~2.5kb).

Obviously, I'd have to optimize the balance between sensitivity and specificity when targeting these reads.

Is there a tool designed for these specific types of screening? If not, would you suggest to write a code from scratch or revise an already somewhat-similar code?

I'm a novice to UNIX but have knowledge in coding (MATLAB, C), so keep that in mind please when answering.

Thanks.

sequencing sequence next-gen • 1.5k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.3 years ago by omer.k ▴ 110

0

Entering edit mode

It's probably important to note that the data is noisy and therefore exact matching of the known tagging sequence won't give you the best results. You'll probably need to do some fuzzy matching of your "recognition sequence".

Since you are looking for cDNA sequences I'm not sure why the second requirement would be an issue?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 1 · Answer 1 · 2017-01-11

I typically use BLAT or cross_match for barcode detection. For ORFs on reads that long, why not just align to the transcriptome fasta? Or, given the read length, align the known ORFs to the reads? ORF prediction on every read will kill your runtime. And the error rate of long reads will kill your ORF detection.

score 0 · Answer 2 · 2017-01-11

0

Entering edit mode

7.3 years ago

omer.k ▴ 110

You'll probably need to do some fuzzy matching of your "recognition sequence".

Yes, that's what I meant by

Obviously, I'd have to optimize the balance between sensitivity and specificity when targeting these reads.

Regarding you second comment

Since you are looking for cDNA sequences I'm not sure why the second requirement would be an issue?

The presence of the ORF is suggestive of having full transcript. It could be the ORF or another ~45 bases on the 5' end. Either way, still we'd like to screen based on these two sequences.

ADD COMMENT • link 7.3 years ago by omer.k ▴ 110

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 0 · Answer 3 · 2017-01-11

0

Entering edit mode

7.3 years ago

omer.k ▴ 110

Hi, apa@stowers, thanks for commenting.

I typically use BLAT or cross_match for barcode detection.

Admittedly I'm not familiar with cross_match. I'll read of it.

Or, given the read length, align the transcriptome CDS fasta to the reads?

Did you offer this try and "thin down" to runtime? Not sure if this is what you meant. Either way the coding region is what I refereed to as the ORF and as I mentioned, ~2.5kb. Which tool would you suggest for alignment of the transcriptome CDS to the reads?

I could also unwillingly give up on this 2.5kb read and just suffice with the two 45-bases long tag, on the 5' and 3' ends of the reads.

ADD COMMENT • link 7.3 years ago by omer.k ▴ 110

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to earlier posts, as such this thread remains logically structured and easy to follow.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Actually your ORF-ends-only approach would certainly be faster. I'm not sure what would be the fastest aligner for very long reads. If it were me, I might try STAR first?

ADD REPLY • link 7.3 years ago by apa@stowers ▴ 600

0

Entering edit mode

We have had good results with GMAP for nanopore cDNA reads.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 0 · Answer 4 · 2017-01-12

I suggest BBDuk for barcode detection. It works by matching kmers, so the efficiency is related to the error rate of your reads. But, it can allow an edit distance, and besides that, Nanopore tends to have bursty errors that are correct for a while (hopefully long enough for a kmer) then wrong for a while, rather than random errors like PacBio.

A command like this:

bbduk.sh in=reads.fq outm=matches.fq literal=ACTG.....GACTG k=19 edist=1 restrictleft=50

...will catch all the reads that match any 19-mer in your literal, allowing 1 edit, looking for kmers only in the first 50bp (use restrictright=50 for the last 50bp). To demultiplex multiple barcodes at once you can use Seal which has similar functionality but produces multiple output files, one per barcode (if you supply a fasta file of barcodes).