Biostar Beta. Not for public use.
Question: How to annotate gene names for genomic coordinates?
4
Entering edit mode

I want to annotate gene names for a segmented file by finding overlap with it with a reference file. Would like to know how can it be done using GenomicRanges Package or any other bioconductor packages.

Segmented_file

Chromosome Start End
1 5930000 11730000
1 16850000 18010000

reference_file

Chr Start End Gene
1 5930500 6230500 SPSB1
1 6930500 7340500 SPSB2
1 16854500 16950000 TAS1R1
1 17810032 17910064 ENO1

Expected results

Chromosome Start End Gene
1 5930000 11730000 SPSB1,SPSB2
1 16850000 18010000 TAS1R1,ENO1
ADD COMMENTlink 4.8 years ago ifudontmind_plzz • 120 • updated 4.8 years ago seidel 6.8k
Entering edit mode
2

If you can use the command line, you can do this in a simple one-liner with BEDOPS bedmap:

$ bedmap --echo --echo-map-id-uniq segmented_file reference_file > answer.bed

The file answer.bed contains results in the format you expect, except that gene names are delimited with semi-colons. You can change this to commas, if you like, by specifying the --multidelim option.

Just make sure first that segmented_file and reference_file are sorted BED files:

$ sort-bed segmented_file.unsorted > segmented_file

Etc.

ADD REPLYlink 4.8 years ago
Alex Reynolds
28k
7
Entering edit mode

Yes, you can use the GenomicRanges library in R to do this. Here is an example:

library(GenomicRanges)
# generate 10 random segments
regions.gr <- GRanges(seqnames=Rle(rep(1,10)), ranges=IRanges(start=sample(1:1000,10), width=100), strand=Rle(rep("*",10)))

# generate 5 random genes
fiveGeneNames <- paste(sample(letters,5), sample(1:100,5), sep="")
genes.gr <- GRanges(seqnames=Rle(rep(1,5)), ranges=IRanges(start=sample(1:1000,5), width=100, names=fiveGeneNames), strand=Rle(sample(c("+","-"),5, replace=T)))

# what regions overlap what genes?
overlapGenes <- findOverlaps(regions.gr, genes.gr)

# Return any genes with an overlap.
# Convert the resulting "Hits" object to a data frame
# and use it as an index
overlapGenes.df <- as.data.frame(overlapGenes)
names(genes.gr[overlapGenes.df$subjectHits])

# extract the regions that hit genes
regionsWithHits <- as.data.frame(regions.gr[overlapGenes.df$queryHits])
# add the names of the genes
regionsWithHits$genes <- names(genes.gr)[overlapGenes.df$subjectHits]

# Some segments might not overlap any gene.
# Find the distance to the nearest gene
distanceToNearest(regions.gr, genes.gr)

# if your genes and segments were in a bed file you could easily import them
library(rtracklayer)
regions <- import.bed("myRegions.bed",asRangedData=FALSE)
genes <- import.bed("myGenes.bed",asRangedData=FALSE)
ADD COMMENTlink 3.3 years ago seidel 6.8k
Entering edit mode
0

Nice one. I tried to use R GRanges for a while, but it was unintuitive as hell. I wanted to merge a de novo annotation to a reference one.

ADD REPLYlink 4.8 years ago
cyril-cros
• 890
Entering edit mode
0

And if you have several genes in a cnv range??? How do you solve this?? thanks a lot

ADD REPLYlink 3.0 years ago
julrodr80
• 0
Entering edit mode
0

You mean instead of segments to genes, you want to know what genes appear within a given segment? The answer is above. Your cnv range (i.e. a segment) will return the index of the genes it overlaps. However, it can also be useful to turn the problem around, and use your genes as the query. i.e. for all genes, which ones overlap which segments. If neither of these answers your question - then I don't understand your question.

ADD REPLYlink 2.9 years ago
seidel
6.8k
4
Entering edit mode

In R you can use the genomeIntervals package and the interval_overlap function.

Outside of R I would recommend ' bedtools intersect'.

ADD COMMENTlink 4.8 years ago David Langenberger 8.9k
Entering edit mode
0

@david, could you specify an example for how to use genomeIntervals package in this case ?

ADD REPLYlink 4.8 years ago
ifudontmind_plzz
• 120
Entering edit mode
3

For two files in bed format, you could do something like this (untested):‚Äč

library(genomeIntervals);
args <- commandArgs(TRUE)

genes.file = args[1]
segmented.file  = args[2]
output.file = args[3]

segments <- read.table(segmented.file, stringsAsFactors=F)
segments.ivals <- new("Genome_intervals_stranded", as.matrix(segments[,2:3]), closed=T, annotation=data.frame(seq_name=segments[,1], inter_base=F, strand=factor(segments[,6], levels=c("+", "-"))))

genes <- read.table(genes.file, stringsAsFactors=F)
genes.ivals <- new("Genome_intervals_stranded", as.matrix(genes[,2:3]), closed=T, annotation=data.frame(seq_name=genes[,1], inter_base=F, strand=factor(genes[,6], levels=c("+", "-"))))

n <- interval_overlap(segments.ivals, genes.ivals)
n_idx <- which(unlist(lapply(n, length)) > 0)

write.table(segments[n_idx,], output.file,row.names=F,col.names=F,append=T,quote=F,sep="\t")
ADD REPLYlink 4.8 years ago
David Langenberger
8.9k
3
Entering edit mode

EDIT: this answer does not use R, but R is not so good at building new annotations. You can do a lot with _an existing annotation, and importing a gtf file into a TxDB object is easy._ I offer a solution using bedtools instead, which is simpler:

bedtools intersect -wa -wb -a segmented_file -b reference_file | awk -F'\t' {$1,$2,$3,$7} | bedtools merge -c 4 -o distinct -i

Bedtools intersect gets your data in the format chrS startS EndS chrR startR endR geneR, for each overlap. This is not a bed file, so we use awk to discard unwanted columns and keep chrS startS EndS geneRfor each overlap. We now need to merge all the genes belonging to a given segmented file interval, which is done with merge.
You also need to have your data in bed format (tab separated), and sorted
Also, see https://www.biostars.org/p/57720/ where they use information on genes strand. You can also find how to convert from bed format to gtf. Merging with a reference annotation can give you the CDS and START/STOP codon positions.

ADD COMMENTlink 4.8 years ago cyril-cros • 890

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0