Question

How to process reverse sequences?

0

Entering edit mode

8.7 years ago

fr ▴ 210

Hello!

I'm processing a bunch of .gff and genbank files to extract some intergenic sequences. Some are forward, others are reverse. But I don't know if the reverse sequences extracted should be inverted? Or something else? How can I process them in order to then extract equivalent subsequences between the leading strand and the reverse strand?

(If they were only coding sequences, what would you need to do?

Note, my question is not so much on how to do something, I just don't know what kind of approach is taken in these sequences).

Thanks

genome sequence • 2.7k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by fr ▴ 210

Ram · Answer 1 · 2015-07-28

Hi r.b.,

As far as I know those can be considered as downstream from 3' <----- 5'

Ref:                                                                ATGCCCTAGTAACCGGATCCCGTA
upstream gene (ATGCCCTAG)                         ATGCCCTAG -> 
downstream gene (ATGCCCTAG)                                                  <- GATCCCGTA

So then it kinda depends on what you want to do with the intergenic sequences.

If you just want the sequences you can just take Ref[1-9] for gene 1 and Ref[24-15] for gene 2. Extract that from you strands and you should be done. If you want you can add a flag to the header that notifies you that it is downstream, or you could just flip the sequence so that it starts with ATG again.

When I did something like this I just flipped them before outputting it to a fasta. As long as you keep the same names you can always find out what the original orientation of the gene was.

Ram · Answer 2 · 2015-07-28

You need to reverse complement the sequence from the negative strand with respect to the reference genome if extracted by genomic coordinates (e.g. using a gff to extract sequence from a fasta). This doesn't depend on what this sequence region is annotated as. However, coding sequences (and sometimes transcript sequences) are normally given in the 'correct' orientation ('spliced and ready to translate'), so they do not contain intergenic or intronic sequence.

A lot of tools search the reverse-complement in addition to the input by default, so you don't need to bother about this. For sequence similarity searches using e.g. blastn, the reverse complement should be always included.