R: How to loop to extract all sequences within specific positions?

0

Entering edit mode

7.4 years ago

francois ▴ 80

In R Studio: I want to extract all sequences from a chromosome within -2000 and +500 of the TSS.

I have already computed the start/end positions of all the sequences I want (let's only consider the plus strand for now):

promoters_plus <- data.frame (chr = 'chr21', prom_start = plus$start - 2000, prom_end = plus$start + 500, strand = '+', id = plus$id)

It am able to extract specific sequences manually like this (for example with the 5th row):

start = promoters_plus$prom_start [5]
stop = promoters_plus$prom_end [5]
subseq (chr21, start, stop)

But I do not know how to correctly loop to get all of them (ideally in a dataframe)

I have tried naively doing this:

for (i in 1:nrow (promoters_plus)) {
  start = promoters_plus$prom_start[i]
 stop = promoters_plus$prom_end[i]
  seq[i] = subseq (chr21, start, stop)
}

But it returns this error:

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript contains NAs or out-of-bounds indices

Can you help?

R sequence • 3.0k views

ADD COMMENT • link updated 6.7 years ago by megha.lal • 0 • written 7.4 years ago by francois ▴ 80

0

Entering edit mode

What is chr21 ? Is it a GRanges object ?

ADD REPLY • link 7.4 years ago by poisonAlien ★ 3.2k

0

Entering edit mode

No, a DNAString object (the sequence of chromsome 21)

ADD REPLY • link 7.4 years ago by francois ▴ 80

0

Entering edit mode

This would be so easy using bedtools getfasta, so if you don't absolutely want to use R for it...

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi,

I'm facing a similar issue. Were you able to resolve the error?

ADD REPLY • link 6.7 years ago by megha.lal • 0

0

Entering edit mode

You could use samtools faidx-indexed FASTA files with a BED-to-FASTA script. For an example, see: A: Searching promoter consensus in a gene

ADD REPLY • link 6.7 years ago by Alex Reynolds 35k

Login before adding your answer.