So I am piping my TCGA files in a for loop and up until variant calling (MuTect2), these RNA-seq have been processed individually.
For those unfamiliar, CGHub (GeneTorrent) downloads TCGA files without a way of knowing which tumor is its matched normal. The only lead is that the subdirectory these fastq files are downloaded under are named with its unique UUID.
I fortunately kept an Excel spreadsheet putting tumor and its matched normal next to each other.
My question is - for someone unfamiliar with SQL - how can code this pseudoalgorithm
- recognize a tumor's UUID and then find its matched normal UUID (in excel)
- identify the corresponding subdirectories (and its fastq) file and assign the two files under
$TUMOR
and$NORMAL
variable (I would like to use these for all the files, nesting it inside a for loop) - Feed it into Mutect's option
Input:tumor $TUMOR
andInput:normal $NORMAL
- and then assign a filename (unique to this set of sample) to a variable
$SET
EDIT: I am trying to solve this since it's an interesting problem. I realize that storing the directory first will be better. Then find the string in the Excel spreadsheet (is it okay if I have multiple sheets?), and then find the string NT
(normal) or TP
(tumor) if TP assign to $TUMOR
and if not $NORMAL
.
My problem now is, how will I associate the tumor to its matching normal? Would it be easier to assign it to $TUMOR_A
$NORMAL_A
, $TUMOR_B
, $NORMAL_B
all before entering a for loop instead of using $TUMOR
and $NORMAL
repeatedly in a for loop (I can't see how this will be possible to be honest).
The only thing I can work with is the line between each sets. If there are other ways to attack this problem, please please let me know