Getting Rna Sequences From Gff And Fa Files
1
0
Entering edit mode
10.7 years ago
MarkR • 0

Hi. I have a folder full of .fa files, and a .gff. The gff file contains information about which loci look like they code for RNA sequences. The .fa contain the DNA sequences for a set of human chromosomes. I want to get all the sequences which code for RNA, as defined by the gff file, out of the DNA in the fasta files. I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff.

I have been trying to code my own little program in F# that will read these files and give me each RNA read defined in the gff, and its corresponding DNA. However I am a bit confused about how it works. Do the start and end of each feature in the gff file define a character in the corresponding .fa file? Are they 1 or 0 indexed? Does it matter what strand they are ('+' or '-') for my purposes?

Ultimately my goal is to get a bunch of RNAs with their corresponding types (miRNA, lincRNA, snRNA... etc) to do some computations on.

My question is this: what is the easiest way to get it out of the data I have?

The data I am using is freely available here: http://wanglab.pcbi.upenn.edu/coral/ under the heading "Annotation packages" if anyone is interested or needs specifics.

Thank you!

gff rna samtools bedtools • 3.2k views
ADD COMMENT
0
Entering edit mode
10.7 years ago
Malcolm.Cook ★ 1.5k

Perhaps my answer to to A: Extract cds fastas from a gff annotation + reference sequence will serve your purposes....???

ADD COMMENT
0
Entering edit mode

Thank you! But I don't think this will break ties between overlapping reads? Also I would prefer to stick F#. I'm not looking for a pre-baked solution, I want to know the structure of fa and gff files so I can code it myself.

ADD REPLY
0
Entering edit mode

hmmm - your Q said nothing about "reads".... anyway, good luck with the f# ... you might want to reword you q to indicate this is required.

ADD REPLY
0
Entering edit mode

"I also have a file telling me which RNA types have higher priority (lincRNA is higher priority than miRNA for example), this tells me which are more important and how I should decided between RNAs for overlapping reads in the gff"

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6