Tutorial:Using Bedtools Closest (closestBed) to make distance histograms of protein binding frequency near all types of pre-RNA junctions (e.g. exon-intron and intron-exon)
0
1
Entering edit mode
5.2 years ago
linuxborg2 ▴ 10

To plot distance histograms of protein binding sites to all types of pre-RNA junctions (intron-exon and exon-intron, etc.), bedtools closest requires post-processing to isolate pertinent information. A batch script that does that post-processing is linked below, and it also uses ngs.plot.r to create exon and genebody distance graphs.

EDIT: I have moved the source code to GitHub: https://github.com/ProfHariSeldon/closestBedPAR-CLIP

For instructions and a one-command batch script working example, download from directory: https://drive.google.com/open?id=1PnudvWR_xbyfqaJJve5R44PHeCyQBBRr file preRNAbinding.tar by right-clicking the TAR file and selecting "Download". Untarring that file creates a subdirectory preRNAbinding, which contains the contents of the TAR file.

To untar:

The TAR should be 1.5 GB.

tar -xvf preRNAbinding.tar
cd preRNAbinding
ls

Expect to see a working directory batchFile and an example of a working directory after a completed run "afterRunningBatchFile".

cd batchFile
ls
vi README.txt

This readme contains the contents of this forum post plus more background information.

batchFile contains the batch script preRNAbinding.s and the input files needed to generate output data files, after which the contents of directory batchFile should be the same as the contents of afterRunningBatchFile.

Step by step instructions to run preRNAbinding.s are self-documented in comments in the SYNOPSIS and PREREQUISITES sections. Batch file preRNAbinding.s post-processes the files in directory "batchFile" to create the data files that it passes to R Studio to plot the distance histograms. preRNAbinding.s also runs ngs.plot.r to create exon and genebody RNA binding protein binding frequency graphs.

To create the histograms: After running preRNAbinding.s, R Studio must be manually run to use the preRNAbinding.s output files to produce the HTML file containing the PNG histograms. Step by step instructions are documented in preRNAbinding.s.

Background: The plots of these calculations are included and explained in my masters thesis Lipscomb_Thomas_Thesis.pdf included in directory batchFile. To understand the structure of pre-RNA see Gene_structure_eukaryote_2_annotated.png in batchFile.

BEDTOOLS LIMITATIONS AND HOW SBATCH FILE preRNAbinding.s OVERCOMES THEM:

  • BEDTOOLS CLOSEST LIMITATIONS:

Limitation 1: Bedtools closest returns negative numbers for RNA binding protein binding sites upstream of a region and positive numbers for RNA binding protein binding sites downstream of a region (e.g. an exon and the regions upstream and downstream), but does not limit the results to only RNA binding protein binding sites within the regions flanking the center region (e.g. intron-exon-intron, the intronic regions flank the center exon region), which is what is needed. If the nearest RNA binding protein binding site is beyond the flanking region then that protein binding site is erroneously mixed in with the flanking region results (see Figure 7A of the thesis PDF Lipscomb_Thomas_Thesis.pdf in the batchFile folder). Also, bedtools closest returns a zero for distance if the RNA binding protein site and preRNA region overlap, instead of the needed distance to either the start or end of region, whichever is closer.

Limitation 2: Erroneous double-counting of QKI binding sites inside the region can happen if the region is short (e.g. exons can be short), causing the two junction histograms to overlap (see Figure 8A of the thesis PDF in the batchFile folder).

  • HOW BEDTOOLS CLOSEST LIMITATIONS ARE OVERCOME by preRNAbinding.s:

Solution to Limitation 1: When bedtools closest returns zero, signifying that the RNA binding protein bound within the region (e.g. the exon or the flanking introns), AWK code provided in preRNAbinding.s measures the distance to the start or the end of the region, whichever is closer (see Figure 7B of the thesis PDF in the batchFile folder). Looking within adjacent regions prevents erroneous inclusion of more distant protein binding sites.

Solution to Limitation 2: Split the center region in half and split the flanking regions in half. This prevents erroneous double-counting, because no matter what the nucleotide distance on the x-axis of the junction histogram is, the junction histogram is given only the distances that were closer to either the start or the end of the center region, whichever the histogram includes (see Figure 8B of the thesis PDF in the batchFile folder). This solution fully applied to intron-exon and exon-intron histograms. This solution only partially applied to the other histograms, which split only the center region in half because the size of the flanking regions is unknown.

AUTHOR:

Thomas H. Lipscomb In partial fulfillment of New York University 2018 MS thesis.

REPORTING BUGS:

Please report bugs to this forum post.

I may not be able to assist in debugging because I will no longer have access to the server I used to generate the example output directory afterRunningBatchFile.

COPYRIGHT: freeware copyleft

preRNA closestBed bedtools-closest bedtools • 2.3k views
ADD COMMENT
4
Entering edit mode

I see a few factors that make the code look suspicious/asks for a lot more than usual:

  1. Source code is on Google drive, not on GitHub or any other Git repo
  2. Source code is >1 GB
  3. Needs R Studio and not just R
  4. Unknown .s file
  5. Author explicitly states they'd not be able to assist in debugging
ADD REPLY
0
Entering edit mode

1: I have moved the source code to GitHub: https://github.com/ProfHariSeldon/closestBedPAR-CLIP

2: Source code is 129 KB (.s, .awk, .py, and .rmd), input files are 291 MB (batchFile folder). The output of a sample run is included so that you may verify that your run on the sample input is running correctly. Output of the sample run including temporary files that trace the computation's progress is 1.12 GB (afterRunningBatchFile folder).

3: To run R Studio in the command line:

$ module load rstudio/1.1.383

$ Rscript -e "knitr::stitch_rmd('Hg38_QKI_closest_dist_freqV9.Rmd')"

4: The .s file is the main batch shell script. The .sh suffix is the more popular suffix, and many people choose to use no suffix.

5: I no longer have access to the high performance computing center at NYU because I graduated, so I am unable to test any code changes. I can answer questions.

ADD REPLY

Login before adding your answer.

Traffic: 2199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6