I was trying to extract novel junctions from the TopHat junctions.bed file, so I wrote a python script and I did that in two steps: first, I parsed GTF file and grouped all the exons (after filtering out CDS and start/stop codons) by gene_id and transcript_id and created a list of all possible junctions for each transcript by taking the ending and starting coordinates of adjacent exons. Second, I made a list of all true junction coordinates from the TopHat bed file by taking into account overhangs (refer to this post). Then, when I finally did the intersection of these lists, I've got an empty list. Does anyone have any idea what was wrong in my approach?
Thanks in advance, Ana