Entering edit mode
4.2 years ago
adeline
•
0
Does anyone have or know where I can find Minion ONT 1D RNA-seq local basecalled .fast5 dataset?
I'm working on developing a pipeline for differential expression and I'm starting with .fast5 files. I've gone through the SRA and EMBL-EBI databases. So far, I've only found fastq files or fast5 signal data without the basecalled fastq (https://github.com/nanopore-wgs-consortium/NA12878/blob/master/RNA.md, https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=ERR2721954).
So, if anyone has and is willing to share or knows where I can find this dataset, it would be very helpful.
So you are looking for fast5 files where the basecalls are inside the fast5 format? Those are increasingly rare. Most commonly people will just create the fastq files and try to avoid using the fast5 as much as possible.
Why would you start from fast5 format? What is the added value for your pipeline?
Since you wrote that you found fast5 files without basecalls, perhaps you could basecall these then?
Thank you very much for your reply.
Yes, I'm looking for .fast5 containing fastq.
Good question. I asked myself the same question. From what I read, the MinION generates fastq files that are above quality threshold 7 (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0216471). I'm using NanoR which also allows the user to extract fastq files with a specified threshold. So I thought this would give the user some flexibility while being simple to perform. I thought it would be good to start with QC, followed by extraction of fastq files according to the desired threshold.
"Since you wrote that you found fast5 files without basecalls, perhaps you could basecall these then?"
Oh, yes, this could work. If the output is fast5, it would be perfect. I'll take a look at the basecallers. Thank you!
That's not accurate. The MinION will generate reads regardless of any quality thresholds but is possible while basecalling to split the reads in a pass or fail folder, for which often the arbitrary cut-off of Q=7 is used, but that's absolutely not guaranteed to be the case. We never filter on quality, but that may depend on your application.
That's not wrong at all, but it would be better if you would also allow users to start from fastq files. You can also calculate the quality from a fastq read and filter from that.
My advice would be to not use the fast5 format if you are not using any information that is not present in the fastq anyway.
Thank you very much for your clarification! I'm new to this and have not been directly involved with the data generation process. Thank you also for your advice. I have been thinking about it and I will look into starting with fastq files instead.