Biostar Beta. Not for public use.
Analyzing digital gene expression data (DGE) from drop-seq pipeline with Seurat.
1
Entering edit mode
14 months ago
chilifan • 70

I am using Seurat to cluster data that previously has been filtered, aligned and turned into DGE by the Drop-Seq alignment pipline from Drop-seq tools. This has created a file sample_DGE.txt.gz. I then want to cluster my data and do a QC analysis through calculating the percent mithocondrial genes. I am following the Seurat Clustering tutorial found here: https://satijalab.org/seurat/pbmc3k_tutorial.html In this tutorial they use the files barcodes.tsv, genes.tsv and matrix.mtx generated by 10x genomics as raw data, and read it with the command Read10X(). I have generated these three files from our DGE data inspired by this biostars page: A: Storing a gene expression matrix in a matrix.mtx It works fine, except that the row name title "GENE" is stored as a column name, saved into barcodes.tsv, which later in Seurat is a problem because seurat uses "GENE" as one of the cell barcodes when calculating the percent mitochondrial DNA per cell. Example below:

enter image description here

This of course, makes it impossible to use VlnPlot, generating the error:

VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3) Error in if(all(data[,feature] == data,feature)) { : missing value where TRUE/FALSE needed

Simple removing "GENE" manually from the barcodes.tsv file creates a error in dimensions at the Read10X step.

pbmc.data <- Read10X(data.dir = "dir/to/barcode_matrix_and_gene_files") Error in dimnamesGets(x, value) : invalid dimnames given for "dgTMatrix" object stop(gettextf("invalid dimnames given for %s object" dQuote(class(x))), domail + NA) dimnamesGets(x, value)

SO my question is: do anyone know a workaround to this problem? Or is there an equivalent to Read10X(), say ReadDGE() or ReadDropseq() that can be used directly on my DGE file?

ADD COMMENTlink
0
Entering edit mode

Thank you @Igor that is the answer to a question I've been pondering for a long time. However, ?CreateSeuratObject uses this example:

pbmc_raw <- read.table(
  file = system.file('extdata', 'pbmc_raw.txt', package = 'Seurat'),
  as.is = TRUE
)
pbmc_small <- CreateSeuratObject(raw.data = pbmc_raw)

which in my case would be:

# Load the PBMC dataset
pbmc.data <- read.table(file=system.file("DGE.txt", package = 'Seurat'), as.is =TRUE)

but it yields the error:

Error in read.table(file = system.file("DGE.txt", package = "Seurat"), : no lines available in input
ADD REPLYlink
1
Entering edit mode

If you are not sure what a function does, you can check by putting a ? in front of it. For example, ?system.file. That will tell you that system.file takes "character vectors, specifying subdirectory and file(s) within some package". In the example, they are using pbmc_raw.txt from the Seurat package. Your file is not stored in the Seurat package. You should specify the exact path where it is. Using system.file is not needed.

ADD REPLYlink
0
Entering edit mode

Thank you @Igor, got it! :)

ADD REPLYlink
1
Entering edit mode
21 months ago
igor 7.7k
United States

If you don't have 10x data, then you don't need to use Read10X(). This is a function to make reading 10x data easier since it's not stored as a simple CSV/TSV table. There is no need to try to recreate that format. In the same tutorial, you can skip to the next step, which is CreateSeuratObject() and then give it your data matrix. When you create the matrix, make sure it has the appropriate columns and rows.

ADD COMMENTlink
1
Entering edit mode
14 months ago
chilifan • 70

Because I like closure I will answer my own question. You need to read in the DGE data before Creating the seurat object. You also need to define column and row names manually and set the data type of data for both the row names (character) and the rest of the data columns (numeric)

#Read DGE file
pbmc.data <- read.table(file = "DGE.txt", header = TRUE, row.names = 1, colClasses =c("character", rep("numeric", 10000)))

I can hardcode 10000 since I chose to include 10000 cells already in the DGE step.

# Initialize the Seurat object with the raw (non-normalized data).
# Keep all genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at least 200 detected genes
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")

Thanks again @igor for giving me the clue to figure this out! :)

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1