Hello,
I am trying to download a single cell RNAseq run from NCBI. It is SRR7898910 also found here
I try to download the data using the NCBI recommended way of downloading the SRA file using prefetch and the using the following command on the terminal:
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-2 --clip SRR7898910
My issue is that this creates a single fastq file - that is 16GB. So I assume it is all the sequences of all the cells in one fastq file. Does anyone know how to get 2 fastq files (forward and reverse) for each cell in the study? It seems like the logical first step as I want to realign them and analyse them in house. Or if that is not possible then to get a bam file for each cell for counting.
Thanks for any recommendations you may have. best wishes
Are you sure there are two reads? NCBI record seems to indicate there is only one read. Of course that information has been wrong at times.
ENA seems to only offer BAM file submitted by the submitters.
It appears to be a uBAM (unaligned BAM) You can download and convert that to fastq(s) byNote: This BAM file is ~33G.samtools fastq
.Edit: It looks like the BAM file has read groups
RG
. There is more than one lane of data. Not sure if there are paired-end reads.10xGenomics data means that one of the pairs has only cell barcode and UMI on it. The bam you excerpted above has that information encoded in it in the BC and CR tags. The bam above is aligned. It's got chromosome and position entries.
Thank you for your help, I will have a look at downloading the uBam file and see what the samtools fastq function does to it.
Hello,
I seem to not have the fastq command in my samtools for some reason even though I'm at the latest version. I've ran samtools bam2fq on the bam file and it's only generated 1 fastq file, shouldn't I have two per cell?
bam2fq is not going to be smart enough to understand that the information in the tags should be combined to make a fastq. You are going to have to parse the fastq yourself if you really want to separate that information out. Are you totally sure that you want to go back to the original fastqs?
Realistically I dont mind either way, I want to do is realign the data to mm10 and generate a count file that has every cell in the experiment so I can pass it into Seurat etc.
Would you happen to know of a way around this? My biggest issue is why are all the cells smushed into 1 file it looks like the most unhelpful way you can deposit data!
You can try contacting the authors to see if they would be willing to give you data in a different format. Otherwise you are going to have to parse the BC/CR tags to split the bam file. Are you sure the aligned data is not already using
mm10
?You have cell barcodes. You have Sam flags, there should be bam tags indicating Gene ids. You can parse the file you have.