Question

Managing data in R

0

Entering edit mode

7.1 years ago

ioannis ▴ 50

Hello community,

I am newly introduced into R. I have got the basic concept of how the objects work but in order to analyse the statistics of my data I need to know much more than that. I have 75bp reads which include hydroxymethylated Cytosines (5hmC). From the reads I have extracted the ones that start with CCGG because the protocol is enzyme restriction based with MspI.

cat bsz_S2_R1_001.fastq | paste - - - - | awk -F '\t'  '(substr($2,1,4)=="CCGG")' | tr "\t" "\n" > bsz_tagged.fastq

Trimmed the adapters using trim_galore. Aligned them to the reference genome using Bowtie2-2.3.1. From the sam file using -grep ,I got a txt file with the scaffold and the genomic position of each read.

Now I need to make a plot of the distribution of the reads within the scaffolds. However, genomic positions might be common between scaffolds but they represent a different position in the genome. So I need to order my data by scaffolds, then by genomic position within each scaffold and then find frequencies of duplicated position values within each scaffold

Basically I have a huge table with two columns and millions of rows. At the end of this post, there are two tables as image. Example of how I want to handle the table:

https://ibb.co/fBEitv (could not make the image work, sorry)

I would appreciate if anyone can tell me which commands are critical or any package that can handle data in this way.

Cheers, Ioannis

R next-gen sequencing • 1.2k views

ADD COMMENT • link 7.1 years ago by ioannis ▴ 50

0

Entering edit mode

Sounds like you wanna take a look at the GenomicRanges package and Rsamtools.

ADD REPLY • link 7.1 years ago by Benn 8.3k

0

Entering edit mode

I will have a look at the instructions! Thank you!

ADD REPLY • link 7.1 years ago by ioannis ▴ 50