closest-Features or closestBed Output to find genes downstream of coordinates
1
0
Entering edit mode
7.0 years ago

Hi all,

I have a set of coordinates in sorted bed format (created by me) along with the set of Ensembl and Refseq genes, also in sorted bed format (from Galaxy).

I would like to find the genes that are immediately downstream of these coordinates. Basically, I believe that the sequence coordinates I have are promoter sequences and I would like to find the genes that are controlled by these elements. I have almost 300,000 sequences, so there may be multiple sequences that map to the same gene. The sequences range in length from 40-150 bases.

Based on my prior search on Biostars, I found that the closest-Features and closestBed packages could give me the genes. N Closest Genes To A Given Location

I have the following questions:

  1. There is no documentation regarding the output of the program. An example of my output is below:

chr1 9999 10524 chr1 11868 14409 ENST00000456328 0 + 14409 14409 0 3 359,109,1189, 0,744,1352, chr1 15058 15238 chr1 14410 29806 ENST00000538476 0 - 29806 29806 0 13 92,39,106,44,139,18,198,132,141,147,99,155,273, 0,589,1385,1493,2196,2337,2447,2822,3191,3504,3857,10326,15123, chr1 21143 21593 chr1 14410 29806 ENST00000538476 0 - 29806 29806 0 13 92,39,106,44,139,18,198,132,141,147,99,155,273, 0,589,1385,1493,2196,2337,2447,2822,3191,3504,3857,10326,15123,

Due to the lack of headers, I cannot make out what the last 3 columns mean. I'm assuming the 4,5,6,and 7th columns indicate the closest transcript

  1. Is there a way I can directly map the sequences to the genes only directly without having to additionally map the transcript IDs to gene IDs and names?

I hope you can help me understand this. Thank you in advance! Sub

bedops gene sequence bedtools • 2.6k views
ADD COMMENT
0
Entering edit mode
7.0 years ago

BEDOPS bedmap will help you map intervals to annotations. BEDOPS closest-features will help you find the closest element(s) to your intervals of interest. You could use either to solve your problem by piping to awk, probably, but I don't really understand your question. Define a clear input and output and I'd be happy to take a stab at it.

ADD COMMENT
0
Entering edit mode

Those are links to documentation, by the way. So if there is something unclear about what you're getting, please do include your inputs and your runtime options. Otherwise, it's rather difficult to help you.

ADD REPLY
0
Entering edit mode

I have a set of coordinates in a bed file seqlist.bed, as follows:

chr1 9999 10524

chr1 15058 15238

chr1 21143 21593

I would like to find the genes that are downstream of this sequence. I suspect that the sequence coordinates are in the promoter region, so I would like to find the genes that are controlled by them.

I downloaded a bed file (genes.bed) containing transcript information from UCSC table browser (hg19- Ensembl genes)

Group: Genes and Gene Predictions

Track: UCSC Genes

Region: Genome

I have used closestbed to compare the coordinates in the seqlist.bed file with the genes.bed file using the following command: closestbed -a seqlist.bed -b genes.bed > closestBedOut.txt

The output for this command is (pipe | added to separate columns):

chr1 | 9999 | 10524 | chr1 | 11868 | 14409 | ENST00000456328 | 0 | + | 14409 | 14409 | 0 | 3 | 359,109,1189, 0,744,1352,

chr1 | 15058 | 15238 | chr1 | 14410 | 29806 | ENST00000538476 | 0 | - | 29806 | 29806 | 0 | 13 | 92,39,106,44,139,18,198,132,141,147,99,155,273, | ,589,1385,1493,2196,2337,2447,2822,3191,3504,3857,10326,15123,

Please let me know if you need further information.

Thanks, Sub

ADD REPLY
1
Entering edit mode

A "closest element" might be very near (likely to be in a proximal promoter) or very far away (within a distal promoter, perhaps, but more ambiguous).

Instead of the closest element, you might define a promoter region to be "proximal", say 500 or 1000 bases upstream of the interval's start ("TSS" for a forward-strand oriented interval) or downstream of its stop ("TSS" for a reverse-strand oriented interval). Then you would do set operations over that range.

The following example returns all genes that overlap the 1000nt region upstream of the intervals, assuming strand information is available in the sixth column of a UCSC-formatted BED file:

$ awk '$6=="+"' intervals.bed | sort-bed - > intervalsFs.bed
$ bedops --range -1000:0 --everything intervalsFs.bed | bedmap --echo-map - genes.bed | sort-bed - | bedops --not-element-of 1 - intervalsFs.bed > answerFs.bed

The following example does the inverse, for reverse-stranded intervals:

$ awk '$6=="-"' intervals.bed | sort-bed - > intervalsRs.bed
$ bedops --range 0:1000 --everything intervalsRs.bed | bedmap --echo-map - genes.bed | sort-bed - | bedops --not-element-of 1 - intervalsRs.bed > answerRs.bed

Then you can union the two answers, for convenience:

$ bedops --everything answerFs.bed answerRs.bed > answer.bed

You could do something similar with closest-features to get the nearest element to the TSS, but I suspect it may be more useful to define a promoter explicitly (i.e., as some range upstream of the TSS) and query for all the genes that overlap an interval's promoter region.

ADD REPLY
0
Entering edit mode

Thank you very much for the reply.

Yes, explicitly setting the promoter distance sounds like a perfect solution. However, I have a question regarding the awk statement.

I assume that this statement adds "+" or "-" as the 6th column to my seqlist.bed (interval.bed) file, in which I have 3 columns as mentioned in my previous post. Upon execution, I do not see any addition to the seqlist.bed file, which still has 3 columns (chromosome, start pos and end pos). I checked my awk and it seems to work okay (I am able to print the columns, etc.).

Can you please elaborate a little on what might be wrong?

EDIT: I found that there may be an issue with using pipe with the output of awk in the bash environment. Could this be what is happening in my case? I found several references to problems with using grep and awk together in the bash environment.

Thanks, Sub

ADD REPLY
0
Entering edit mode

If an interval has no strand information, I would be unsure where the TSS is of that interval (and how to define the promoter region). I always thought a promoter would be upstream of a TSS, but I may be wrong about that. So if your intervals lack strand information, how are you defining a promoter?

ADD REPLY
0
Entering edit mode

I have defined some short sequences which have been taken from several regions in the genome in the seqlist.bed file. Some of these sequences may fall within the promoter region, and some may not.

Like you have suggested, I would consider 1000 bp upstream of the TSS and 100 bp downstream to constitute the putative promoter region. So, I would like to find the sequences in the seqlist.bed file that fall within the -1000 to +100 region of any gene.

The genes.bed file has the gene start/end position and TSS position information. So, I assume that the pipeline for searching would be to compare each row in the seqlist.bed file with the TSS information of every row in the genes.bed file. Then, I can retrieve the sequence coordinates from the seqlist.bed file and the corresponding gene name and TSS information from the genes.bed file.

Please let me know if this sounds good.

Thanks, Sub

ADD REPLY
0
Entering edit mode

I have added the strand information manually to the intervals. I have already categorized the intervals based on the strands, so I have a set of coordinates for the + strand and another for the - strand.

I was asking the previous question specifically regarding awk, because I could not add the column to the bed file using the awk command. Instead, I used excel to add the strand information, saved it as a .bed file and then used dos2unix to convert it to Linux compatible format.

Thanks, Sub

ADD REPLY

Login before adding your answer.

Traffic: 1335 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6