Question

How do I filter out transcripts with less than two exons

0

Entering edit mode

5.9 years ago

alex.v.nesta ▴ 50

Hi,

I have a GFF file containing MCF-7 cell transcript data. There is also a fastq file if that can be helpful.

chr1 PacBio transcript 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1"; chr1 PacBio exon 27567 29338 . - . gene_id "PB2015.1"; transcript_id "PB2015.1.1";

These are the first two lines.

You can see that it identifies both transcripts and exons. I want to filter out any transcripts containing less than two exons for each transcript.

Additional Info:

I have attempted doing this with UCSC Table Browser and Galaxy. They both end up throwing errors.

Galaxy Error: An error occurred with this dataset: Traceback (most recent call last): File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_count.py", line 182, in <module> __main__() File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/gff/gff_filter_by_feature_c

File "/cvmfs/main.galaxyproject.org/galaxy/lib/galaxy/datatypes/util/gff_util.py", line 191, in __next__ self.seed_interval = GenomicIntervalReader.next(self) RuntimeError: maximum recursion depth exceeded while calling a Python object

Filter 18: MCF7 hg19.gff
Using feature name exon
With following condition >1

Table browser doesn't list exon as a possible filter option when I upload this dataset as a custom track.

I am very new to this. Does anyone have any suggestions for me here? I can use R pretty fluently and I have a little bit of python ability. I also have bedtools set up but I don't know how to use it very well.

Please point me in the right direction!

Thanks, Alex.

bedtools gff bed galaxy • 2.7k views

ADD COMMENT • link updated 5.9 years ago by cpad0112 21k • written 5.9 years ago by alex.v.nesta ▴ 50

1

Entering edit mode

awk '{ if ($3 == "exon") print $0; }'your_gtf_file | grep -o "transcript_id [^;]\+;" | awk -v FS=" " '{ arr[substr($2, 2, length($2)-3)] += 1; } END { for (t in arr) { if (arr[t] > 1) { print t;} } }'

will yield a list of the ids of all multi-exon transcripts. Is that what you're looking for? (I noticed the line format you provided above is gtf not gff, btw.)

ADD REPLY • link 5.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

I just typed out a nice comment and accidentally pressed cancel instead of add comment. Lost everything. Super pissed.

Summary: Unfortunately your script identified all transcripts as multi exon. Thanks for your efforts. I think I can figure this out in R now though.

Thanks for pointing out this is GFFv2/GTF format. Now I am able to import and export win R without corruption.

here is some toy data. 2 transcripts to keep and 2 to eliminate. In case you can figure out what went wrong in your script. I will update if I can get this sorted in R. Galaxy still errors out even though I said it was GTF format.

chr1    PacBio  transcript  27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  exon    27567   29338   .   -   .   gene_id "PB2015.1"; transcript_id "PB2015.1.1";
chr1    PacBio  transcript  335264  336700  .   +   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  exon    335264  336700  .   +   .   gene_id "PB2015.2"; transcript_id "PB2015.2.1";
chr1    PacBio  transcript  661285  663852  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    661285  662523  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  exon    662866  663852  .   +   .   gene_id "PB2015.3"; transcript_id "PB2015.3.1";
chr1    PacBio  transcript  661268  663688  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  exon    661268  662168  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  exon    663125  663688  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.1";
chr1    PacBio  transcript  661290  663352  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";
chr1    PacBio  exon    661290  662469  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";
chr1    PacBio  exon    662928  663352  .   -   .   gene_id "PB2015.4"; transcript_id "PB2015.4.2";

ADD REPLY • link 5.9 years ago by alex.v.nesta ▴ 50

0

Entering edit mode

Hmm, when I run this on your toy data, it returns only the three real multi-exon transcripts, i.e. PB2015.3.1, PB2015.4.1, PB2015.4.2.

ADD REPLY • link 5.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Sorry I wasn't able to make your script work. I can take another look at it in a bit. But I still have good news! I managed to sort out the issue in R! verified elimination of single exon transcripts comparing the two tracks using IGV.

This honestly wouldn't have been possible without your noticing the format.

library(rtracklayer)
library(tidyverse)
MCF7 <- as.tibble(import("MCF7hg19.gtf"))

MultiExons <- MCF7 %>% group_by(transcript_id) %>% filter(n() > 2)

nrow(MultiExons)
nrow(MCF7)

export(MultiExons, "MCFMultiTranscripts.gtf", format="GTF")

ADD REPLY • link 5.9 years ago by alex.v.nesta ▴ 50

0

Entering edit mode

Well, that is good that you got a solution and that's what counts. I really need to get my R sorted these days...

ADD REPLY • link 5.9 years ago by cschu181 ★ 2.8k

score 0 · Answer 1 · 2018-06-03

output:

$ sed 's/\s/\t/g' ./gencode_v22_grch38_annotation/gencode.v28.annotation.gtf | awk '$3 == "exon"'| datamash -s  -g 12 count 3 | awk '$2 < 2' | head
"ENST00000194152.3";    1
"ENST00000194155.6";    1
"ENST00000209540.2";    1
"ENST00000216407.4";    1
"ENST00000216743.6";    1
"ENST00000218249.7";    1
"ENST00000222033.5";    1
"ENST00000223076.2";    1
"ENST00000225587.1";    1
"ENST00000229030.4";    1

input:

$ head ./gencode_v22_grch38_annotation/gencode.v28.annotation.gtf

##description: evidence-based annotation of the human genome (GRCh38), version 28 (Ensembl 92)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2018-03-23
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "RP11-34P13.1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";