Hello community,
I am newly introduced into R. I have got the basic concept of how the objects work but in order to analyse the statistics of my data I need to know much more than that. I have 75bp reads which include hydroxymethylated Cytosines (5hmC). From the reads I have extracted the ones that start with CCGG because the protocol is enzyme restriction based with MspI.
cat bsz_S2_R1_001.fastq | paste - - - - | awk -F '\t' '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsz_tagged.fastq
Trimmed the adapters using trim_galore.
Aligned them to the reference genome using Bowtie2-2.3.1.
From the sam file using -grep
,I got a txt file with the scaffold and the genomic position of each read.
Now I need to make a plot of the distribution of the reads within the scaffolds. However, genomic positions might be common between scaffolds but they represent a different position in the genome. So I need to order my data by scaffolds, then by genomic position within each scaffold and then find frequencies of duplicated position values within each scaffold
Basically I have a huge table with two columns and millions of rows. At the end of this post, there are two tables as image. Example of how I want to handle the table:
https://ibb.co/fBEitv (could not make the image work, sorry)
I would appreciate if anyone can tell me which commands are critical or any package that can handle data in this way.
Cheers, Ioannis
Sounds like you wanna take a look at the GenomicRanges package and Rsamtools.
I will have a look at the instructions! Thank you!