I have recount2 data from breast TCGA RNA-seq. The recount2 data file IDs are TCGA legacy UUIDs. Upon converting these legacy UUIDs to harmonized UUIDs, there are 5 duplicate aliquot UUIDs and aliquot barcodes (after exclusion of FFPE samples).
legacy UUID > harmonized aliquot UUID > aliquot barcode
These 5 duplicate harmonized UUIDs have different FASTQ files and count data from the legacy archive. Does anybody have any recommendations on how to handle these, and why the same aliquot may have been analyzed twice?
It is difficult to know without speaking to the people who were actually involved in sequencing this particular patient's samples, which was performed at UNC, I can see. The likelihood is that they simply had money to spend toward the end of the project and decided to sequence whatever else they could. Many (or all?) funding bodies prefer you to spend all of the money that they have invested in you.
My preference would be to include them all in your analysis and check what happens when you, for example, perform PCA and generate a bi-plot. If they all line-up on top of each other in the plot space, then you can justify keeping one or all of them. If they do not group together, then there is an issue.
There are many cases like this in the TCGA. Most just involve 'executive' decisions by you as the analyst as you work through everything (and obviously you should make note of it).