Question

TCGA: Bash assigning a tumor to its match normal

0

Entering edit mode

8.2 years ago

umn_bist ▴ 390

So I am piping my TCGA files in a for loop and up until variant calling (MuTect2), these RNA-seq have been processed individually.

For those unfamiliar, CGHub (GeneTorrent) downloads TCGA files without a way of knowing which tumor is its matched normal. The only lead is that the subdirectory these fastq files are downloaded under are named with its unique UUID.

I fortunately kept an Excel spreadsheet putting tumor and its matched normal next to each other.

My question is - for someone unfamiliar with SQL - how can code this pseudoalgorithm

recognize a tumor's UUID and then find its matched normal UUID (in excel)
identify the corresponding subdirectories (and its fastq) file and assign the two files under $TUMOR and $NORMAL variable (I would like to use these for all the files, nesting it inside a for loop)
Feed it into Mutect's option Input:tumor $TUMOR and Input:normal $NORMAL
and then assign a filename (unique to this set of sample) to a variable $SET

EDIT: I am trying to solve this since it's an interesting problem. I realize that storing the directory first will be better. Then find the string in the Excel spreadsheet (is it okay if I have multiple sheets?), and then find the string NT (normal) or TP (tumor) if TP assign to $TUMOR and if not $NORMAL.

My problem now is, how will I associate the tumor to its matching normal? Would it be easier to assign it to $TUMOR_A $NORMAL_A, $TUMOR_B, $NORMAL_B all before entering a for loop instead of using $TUMOR and $NORMAL repeatedly in a for loop (I can't see how this will be possible to be honest).

The only thing I can work with is the line between each sets. If there are other ways to attack this problem, please please let me know

RNA-Seq TCGA MuTect2 • 1.6k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by umn_bist ▴ 390