Question

Tutorial:Generating count matrix for STAR counts in GDC v32.0 for RNA-Seq

3

Entering edit mode

2.1 years ago

DareDevil ★ 4.3k

## Load the required library
library('TCGAbiolinks')
project_name <- "TCGA-ACC"

## Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

## Get metadata matrix
metadata <- query[[1]][[1]]

## Download data using api
GDCdownload(query, method = "api")

## Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
## Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE)) 

## Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
## Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
## STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])

## Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
  ## Read table
  test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
  ## Delete unwanted lines
  test_tab <- test_tab[-c(1:4),]
  ## Column bind of tpm and counts data
  tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
  count_data_frame <- cbind(count_data_frame, test_tab[,4])
  ## Print progres from 0 to 1
  print(i/length(file_list))
}

GDC TCGABiolinks STAR • 2.8k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 2.1 years ago by DareDevil ★ 4.3k

0

Entering edit mode

Thank you for the tutorial. I've looked all over the net and couldn't find how to load TPM data (most forums post how to retrieve count). I have one problem though, after getting tpm_data_frame, the header becomes test_tab[,7] for all columns. How do I retrieve the unique sample ID for each column? Thank you.

ADD REPLY • link 16 months ago by bioinfonewbie • 0

score 0 · Answer 1 · 2023-02-15

Please have a look at this: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493. I think we can get TPM counts directly from TCGAbiolink.

############### download OV data from TCGA ################
# defines the query to the GDC
query <- GDCquery(project = "TCGA-OV",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")


# download query data using api
GDCdownload(query = query, method = "api")

# prepare data
data <- GDCprepare(query)
data

Starting to add information to samples
 => Add clinical information to samples
 => Adding TCGA molecular information from marker papers
 => Information will have prefix 'paper_' 
Available assays in SummarizedExperiment : 
  => unstranded
  => stranded_first
  => stranded_second
  => tpm_unstrand
  => fpkm_unstrand
  => fpkm_uq_unstrand

## generate TPM counts matrix
TPM <- assay(data,4) %>% data.frame()
TPM[1:3,1:3]