I have a list of ~1500 (over a few samples) chromosome:peakstart:peakend from my recent CLIP-seq experiment. I want to know:
1) What genes each peak maps to
2) What sequences are included in those peaks. Is there a quick way to do this? I only could figure out how to the full CDS, not just the peak.
I was able to find all the genes by inputting my peaks as a csv into UCSC Table Browser, but when I get an output it puts every transcript possible on different lines (ex if my peak mapped to gene X, I would get gene X transcript 1, gene X transcript 2, etc). I would like to copy back this list to my original data in excel, but now the lines don't match up. Does anyone have a recommendation?
I'd recommend converting your file into bed format then using bedtools to intersect with annotated genes (this intersection will give you the full gene length but it will include gene names). You can also run the bedtools getfasta command to convert the bed file into a FASTA file.
If this is CLIP-seq then you should also have strand information (if you performed a strand specific library prep, if you DO have strand information add a -s flag to the two bedtools commands), otherwise there will be ambiguity as to which gene your peak intersects with.
Convert from Chr:start:end to a bed format:
sed 'y/\:/\t/' peakFile.txt > peakFile.bed
Intersect with some annotated gene file (you should be able to get these from UCSC:
bedtools intersect -a annotatedGeneFile.bed -b peakFile.bed -wa -u > annotatedGenesThatOverlapPeaks.bed
Yes, the 6th column has + and - for both files. However I converted my hg19 file from a gtf to a bed to use for this. the hg19 bed file has 9 column. The first 6 looks as they should, but the last three saw things like "unknown exon" and "Parent=NR_204540" rather than "thickStart" , "thickEnd" and "itemRgb." DO you think that makes a difference.
The format looks right. Did you perform a strand-specific library prep and strand specific mapping of the libraries? It's possible that you mapped the libraries to the incorrect strand, try running with the -S flag (capital s), this will find peaks that map to the opposite strand.
If you do a line count of the intersectBed -s and intersectBed -S they should sum to intersectBed with no strand flag. if the -S flag works I'd wonder if there was an error in your original mapping step, if you still get no lines then there must be a formatting problem.
Thank you for your help! Very useful.
I got bedtools to work with your suggestions and ran the following command sucessfully:
bedtools intersect -wa -wb -a hg19.bed -b Peaks18.3.bed
However, I tried to force strandedness with "-s" and got nothing.
Do you know why adding -s would prevent it from working? I'm using bedtools/2.26.0.
Do you have strand information (either a
+
or a-
) in the 6th column for both bed files?Yes, the 6th column has + and - for both files. However I converted my hg19 file from a gtf to a bed to use for this. the hg19 bed file has 9 column. The first 6 looks as they should, but the last three saw things like "unknown exon" and "Parent=NR_204540" rather than "thickStart" , "thickEnd" and "itemRgb." DO you think that makes a difference.
It could, I would cut the file so that it's six columns. If that doesn't work then paste a line from each input and I'll have a look.
I changed hg19.bed to a six column file, but still couldn't get the -s option to work. Here's a line from each of my inputs:
hg19.bed: chr1 11873 14409 eID=DDX11L1 . +
Peaks18.3.bed: chr9 75785164 75785290 18.3 1 +
The format looks right. Did you perform a strand-specific library prep and strand specific mapping of the libraries? It's possible that you mapped the libraries to the incorrect strand, try running with the
-S
flag (capital s), this will find peaks that map to the opposite strand.If you do a line count of the
intersectBed -s
andintersectBed -S
they should sum tointersectBed
with no strand flag. if the-S
flag works I'd wonder if there was an error in your original mapping step, if you still get no lines then there must be a formatting problem.