Question

Downloading single cell data from NCBI

0

Entering edit mode

5.3 years ago

V ▴ 380

Hello,

I am trying to download a single cell RNAseq run from NCBI. It is SRR7898910 also found here

I try to download the data using the NCBI recommended way of downloading the SRA file using prefetch and the using the following command on the terminal:

fastq-dump --outdir fastq --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-2 --clip SRR7898910

My issue is that this creates a single fastq file - that is 16GB. So I assume it is all the sequences of all the cells in one fastq file. Does anyone know how to get 2 fastq files (forward and reverse) for each cell in the study? It seems like the logical first step as I want to realign them and analyse them in house. Or if that is not possible then to get a bam file for each cell for counting.

Thanks for any recommendations you may have. best wishes

ncbi rna-seq fastq-dump single-cell • 4.4k views

ADD COMMENT • link updated 8 months ago by Ram 43k • written 5.3 years ago by V ▴ 380

0

Entering edit mode

Are you sure there are two reads? NCBI record seems to indicate there is only one read. Of course that information has been wrong at times.

ENA seems to only offer BAM file submitted by the submitters. ~~It appears to be a uBAM (unaligned BAM) You can download and convert that to fastq(s) by samtools fastq.~~ Note: This BAM file is ~33G.

Edit: It looks like the BAM file has read groups RG. There is more than one lane of data. Not sure if there are paired-end reads.

K00135:310:HW73WBBXX:1:1127:22698:20111 16      1       3201566 255     115M    *       0       0       CTTTGTGTCTGTGTCTTTCATTTGCCTATGAAAAGAATGTTAGTTGGCTGTAACCATAAAATTGGCAGTTGTATTTACAAATAATCACATA
TATGATGCTTACAGAATGATGGGC        AJFFA<7-77JJJFJFFJJJJAFJAJAJJJJJJJJJJFJJAAJA-JFJJJJJFFJFAF<<AF<JJJJJJFFFFF-<JFFJJ<FJFJJJFJFF7JAAJJFJ<JJJJJJJJJFFFAA     NH:i:1  HI:i:1  AS:i:113        nM:i:0  RE:
A:I     BC:Z:CTCGCGTA   QT:Z:AAFAFJJJ   CR:Z:TGAAAGATCCATGAAC   CY:Z:AAFFFJJJJJJJJJJJ   CB:Z:TGAAAGATCCATGAAC-1 UR:Z:CCGTCCTATG UY:Z:JJJJJJJJJJ UB:Z:CCGTCCTATG RG:Z:7wo_B6N_males_Pool2_mm10v7b:Mi
ssingLibrary:1:HW73WBBXX:1
K00135:310:HW73WBBXX:2:2110:8907:7257   16      1       3201567 255     115M    *       0       0       TTTGTGTCTGTGTCTTTCATTTGCCTATGAAAAGAATGTTAGTTGGCTGTAACCATAAAATTGGCAGTTGTATTTACAAATAATCACATAT
ATGATGCTTACAGAATGATGGGCT        AA7F<A77-A-FFFA-F<F7--AA<AJ<<FJJA<JF-FF<JF7JAF<FFAA-7F<FF-FAJ<<J<FFA<AJF<JJJJJJJJJA<<FF<<AFA-FAAF<-7FAJJJF7777FF<<A     NH:i:1  HI:i:1  AS:i:113        nM:i:0  RE:
A:I     BC:Z:AAATGTGC   QT:Z:AAAA-F-F   CR:Z:TGCTACCAGTACGATA   CY:Z:AAFFFJJJJFJJJFJF   CB:Z:TGCTACCAGTACGATA-1 UR:Z:GACCCTCAGG UY:Z:FFJJJJJJFJ UB:Z:GACCCTCAGG RG:Z:7wo_B6N_males_Pool2_mm10v7b:Mi
ssingLibrary:1:HW73WBBXX:2

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

10xGenomics data means that one of the pairs has only cell barcode and UMI on it. The bam you excerpted above has that information encoded in it in the BC and CR tags. The bam above is aligned. It's got chromosome and position entries.

ADD REPLY • link 5.3 years ago by swbarnes2 14k

0

Entering edit mode

Thank you for your help, I will have a look at downloading the uBam file and see what the samtools fastq function does to it.

ADD REPLY • link 5.3 years ago by V ▴ 380

0

Entering edit mode

Hello,

I seem to not have the fastq command in my samtools for some reason even though I'm at the latest version. I've ran samtools bam2fq on the bam file and it's only generated 1 fastq file, shouldn't I have two per cell?

ADD REPLY • link 5.3 years ago by V ▴ 380

0

Entering edit mode

bam2fq is not going to be smart enough to understand that the information in the tags should be combined to make a fastq. You are going to have to parse the fastq yourself if you really want to separate that information out. Are you totally sure that you want to go back to the original fastqs?

ADD REPLY • link 5.3 years ago by swbarnes2 14k

0

Entering edit mode

Realistically I dont mind either way, I want to do is realign the data to mm10 and generate a count file that has every cell in the experiment so I can pass it into Seurat etc.

Would you happen to know of a way around this? My biggest issue is why are all the cells smushed into 1 file it looks like the most unhelpful way you can deposit data!

ADD REPLY • link 5.3 years ago by V ▴ 380

0

Entering edit mode

You can try contacting the authors to see if they would be willing to give you data in a different format. Otherwise you are going to have to parse the BC/CR tags to split the bam file. Are you sure the aligned data is not already using mm10?

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

You have cell barcodes. You have Sam flags, there should be bam tags indicating Gene ids. You can parse the file you have.

ADD REPLY • link 5.3 years ago by swbarnes2 14k