Question

Getting these files from different parts of genome

1

Entering edit mode

5.2 years ago

zizigolu ★ 4.3k

Hi,

For running ActivedriverWGS software I will need coding or non coding parts of genome in BED12 format. I have found coding part of genome (in txt format though). But I don't know how to find non coding part of genome (BED12 format) also Transcription factor binding in BED4 format. I have contacted the developer but no response. Any suggestion please?

BED WGS genome SNP R • 2.4k views

ADD COMMENT • link 5.1 years ago by zizigolu ★ 4.3k

3

Entering edit mode

I will need coding or non coding parts of genome in BED12 format. I

are you sure that's what you want? The documentation says: "Regions of interest can be coding or noncoding should be in a BED12 format", so you basically need a BED file of the regions for which you want to do the analysis.

I also did not get the impression that TF binding sites are required, they might be nice to have, but for that you would have to identify the TF of interest first (and search e.g. ENCODE for respective binding sites).

ADD REPLY • link 5.2 years ago by Friederike 8.9k

1

Entering edit mode

I don't know a lot about this software but appears to take "regions of interest" rather than whole genome information about this data https://github.com/reimandlab/ActiveDriverWGS

I would recommend probably using the UCSC table browser to get BED output for this info also

ADD REPLY • link 5.2 years ago by cmdcolin ★ 3.8k

0

Entering edit mode

If you mean this GitHub issue #10, looks like developer responded?

ADD REPLY • link 5.2 years ago by zx8754 11k

0

Entering edit mode

Thank you, How I could find non-coding part of genome?

For example when downloading this software we can get coding part of genome (although in txt format)

wget https://bitbucket.org/bbglab/oncodrivefml/downloads/oncodrivefml-examples_v2.0.tar.gz

But I don't know where I could find non coding part of genome

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

The non-coding part of the genome is everything which is not... coding. So it would essentially be the complement of the bed file of the coding sequences. But that is unlikely to be what you need for your tool. See also the comment of Friederike You just need regions of interest.

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you, but I have already calculated driver genes for coding part of genome by another software; Now I need to do the same for non coding part of genome for which I will need a file contains non coding regions of human genome that I don't know how to get that.

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

No, it is unlikely that your tool just expects a bed file of all non-coding regions in the human genome. But anyway, if you insist; the answer is bedtools complement.

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, what is the input here when the expected output is non coding in BED12?

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Spend some time reading our comments here and the documentation of bedtools complement. I'm not coming to sit next to you and do your work.

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

:(

The same story

You only once sat next to me and did my work, when I was in Germany for interview

you and Genomax

Thank you

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Well, I'm sure you can figure this out :-)

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry,

Likely the coding and non-coding regions of human genome are here

https://www.gencodegenes.org/human/release_19.html

I have converted gtf to bed by bedops

so I have this

chr1    29553   30039   ENSG00000243485.2   .   +   HAVANA  exon    .   gene_id "ENSG00000243485.2"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-001"; exon_number 1;  exon_id "ENSE00001947070.1";  level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

How I could extract below information from this bed , for example from first line like below to whole

chr1    29553   30039   ENSG00000243485.2   +  gene_name "MIR1302-11"

I asked my question in another forum they closed my post :(

I trie my bed as a txt to extract what I want but I got error

> paste(strsplit(regions.txt, "\\s+|\t|\\\"")[[1]][c(1,2,3,4,5,6,26,28)],collapse="\t")
Error in strsplit(a, "\\s+|\t|\\\"") : non-character argument
>

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

You are mixing up terminology.

'Coding' and 'non-coding' are confusing terms, because it can mean multiple things. In transcriptomics people would subgroup transcripts in coding and non-coding transcripts, meaning "do these RNA molecules get translated to a protein?". Here non-coding transcript means every transcript that does not lead to a protein (as far as we know!).

In genomics, however, regions of the DNA are subgrouped in coding and non-coding, roughly meaning "does this sequence get transcribed to an RNA molecule?". Here non-coding fragment means every piece of DNA that does not lead to a transcript (as far as we know!).

I'd suggest being complete with regards to what you are looking for. I don't like the term "non-coding transcript". For me it is a "non-protein-coding transcript". The transcript is coding (=has a functional product) but it just doesn't create a protein.

It seems to me you are looking for non-coding DNA regions, while what you found on Gencode are non-protein-coding transcripts.

(Note that my comment here ignores biological noise: random transcription without function. The extent of this phenomenon is an open debate.)

ADD REPLY • link 5.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I second everything Wouter wrote. I think we need to clarify what types of regions you actually want to look at using the tool (not what the tool says it needs, tell us what the goal of your analysis is).

ADD REPLY • link 5.2 years ago by Friederike 8.9k

0

Entering edit mode

Thank you

Is Long non-coding RNA gene annotation non-coding DNA regions?

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Wouter has addressed precisely that question.

DNA:

coding DNA = genes = the basis for RNA transcripts (in mammals, this is a small fraction of the genome!)
- non-coding genes (a misnomer!) encode RNA that do not give rise to proteins, e.g. snoRNA, miRNA, rRNA, tRNA....
- protein-coding RNA genes
non-coding = not genic = intergenic = no RNA transcripts (except for the transcriptional noise mentioned by Wouter)

ADD REPLY • link 5.2 years ago by Friederike 8.9k

score 2 · Accepted Answer · 2019-02-24

Sorry, finally I got what I want

I actually need enhancers, promoters, or other regulatory elements of human genome to find cancer driver genes placed in these regions

People says by

 awk '$3=="transcript" && 
     $20!="\"protein_coding\";" &&
     $20!="\"translated_processed_pseudogene\";"' gencode.gtf

Will return non-coding parts of regions

like

awk '$3=="transcript" && $20!="\"protein_coding\";"{print $20}' gencode.gtf  | sort | uniq -c | sort -nk1
      1 "translated_processed_pseudogene";
      2 "Mt_rRNA";
      3 "IG_J_pseudogene";
      3 "TR_D_gene";
      4 "TR_J_pseudogene";
      5 "TR_C_gene";
     10 "IG_C_pseudogene";
     18 "IG_C_gene";
     18 "IG_J_gene";
     22 "Mt_tRNA";
     25 "3prime_overlapping_ncrna";
     27 "TR_V_pseudogene";
     37 "IG_D_gene";
     58 "non_stop_decay";
     59 "polymorphic_pseudogene";
     74 "TR_J_gene";
     97 "TR_V_gene";
    144 "IG_V_gene";
    182 "unitary_pseudogene";
    196 "IG_V_pseudogene";
    330 "sense_overlapping";
    387 "pseudogene";
    442 "transcribed_processed_pseudogene";
    531 "rRNA";
    802 "sense_intronic";
    860 "transcribed_unprocessed_pseudogene";
   1529 "snoRNA";
   1923 "snRNA";
   2050 "misc_RNA";
   2549 "unprocessed_pseudogene";
   3116 "miRNA";
   9710 "antisense";
  10623 "processed_pseudogene";
  11780 "lincRNA";
  13052 "nonsense_mediated_decay";
  25955 "retained_intron";
  28082 "processed_transcript";

But I am not sure from these regions which parts are related to enhancers, promoters, or regulatory elements

score 1 · Accepted Answer · 2019-02-24

1

Entering edit mode

5.1 years ago

zizigolu ★ 4.3k

https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D

ADD COMMENT • link 5.1 years ago by zizigolu ★ 4.3k