Question

Gene list from .bed file needed, please help!

0

Entering edit mode

6.9 years ago

o.hickman ▴ 10

I have bed files (from an ENCODE eCLIP experiment) in the format below.

I need to obtain a gene list from the chromosomal coordinates.

I have tried Galaxy: using USCS table browers KnownGene and kgXref functions, and join operations, but the gene list I get has clearly been duplicated in some way as some genes have the correct number of eCLIP tags, and some have thousands more than are evident when I view the .bed file in IGV.

has anybody got a simple, up to date way or solving this. I do not code so simple explanations if possible. Previous workflows in galaxy have not worked.

Thanks in advance!!

Oliver

chr7 155100450 155100506 rep02 1000 + 4.49254608837777 22.7294143201152 -1 -1

chr7 155100424 155100441 rep02 1000 + 3.74937915504325 15.3042207236355 -1 -1

alignment RNA-Seq eCLIP iCLIP Galaxy • 2.3k views

ADD COMMENT • link updated 6.9 years ago by EagleEye 7.5k • written 6.9 years ago by o.hickman ▴ 10

1

Entering edit mode

For your next post, don't forget to specify that you don't use Linux. You are making it harder on yourself as such because many tools in bioinformatics are made for Linux. Some might be available in Windows as well, but not optimal.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

1

Entering edit mode

don't forget to specify that you don't use Linux

enter image description here

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

If you happen to use right Win10 version you would be able to use the unix bash shell available. But I do concur with @Wouter.

ADD REPLY • link 6.9 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-06-07

0

Entering edit mode

6.9 years ago

Pierre Lindenbaum 161k

$ cat input |\
awk '{printf("select K.chrom,MIN(K.txStart),MAX(K.txEnd),X.geneSymbol from knownGene as K,kgXref as X where K.chrom=\"%s\" and NOT(K.txEnd < %s or K.txStart>%s) and K.name=X.kgId group by K.chrom,X.geneSymbol;\n",$1,$2,$3);}' |\
mysql -N --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19  |\
sort | uniq

ADD COMMENT • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, Would you mind explaining that post? Thanks, Oliver

ADD REPLY • link 6.9 years ago by o.hickman ▴ 10

0

Entering edit mode

'input' is your bed file;
awk is used to build a mysql query fetching the chrom/start/end/geneSymbol from the UCSC in each BED line.
pipe those queries into mysql
remove the duplicates with sort | uniq

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Thanks Pierre, Is this using R? Thanks, Oliver

ADD REPLY • link 6.9 years ago by o.hickman ▴ 10

0

Entering edit mode

No it is not. It is using cat/awk/sort/uniq that are built into UNIX and mysql.

ADD REPLY • link 6.9 years ago by GenoMax 141k

score 0 · Answer 2 · 2017-06-07

0

Entering edit mode

6.9 years ago

EagleEye 7.5k

I assume you would like to associate your binding sites (From ENCODE eCLIP) to genes or different genomic locations. In that case you can make use of, GREAT OR Homer annotate peaks (Homer will provide detailed results when you use your own GTF annotation).

ADD COMMENT • link 6.9 years ago by EagleEye 7.5k

0

Entering edit mode

Thanks for that, I don't have a Unix OS but if I get access to one I will try downloading HOMER and give it a try. O.