Biostar Beta. Not for public use.
RNA secondary structure - Feature extraction
1
Entering edit mode
16 months ago
NIT Calicut

Hi All, For coding-non coding RNA identification (using Machine learning classifier), I would like to add features extracted from RNA secondary structure. I used RNAfold to get the secondary structure from primary sequence (as dot-bracket representation). Now I want to identify loops, stems, bulges, etc.., from the structure and represent as a feature vector (with some numerical values).

• Is there any tool for this purpose?
• how can I identify the structural elements from dot-bracket notation?
• Is there any better numerical/vector representation for RNA secondary structure for machine learning applications?
1
Entering edit mode

For your first point, I have code for identifying secondary structure elements from base-pair lists (https://github.com/cschu/biolib, mdg_dt.py, don't judge me on the code :P) and I am quite certain I also have code lying around somewhere to convert bracket notation into such a list (just need to find it)... Alternatively, if you can program, you could write it yourself:

One option would be to iterate through the string, pushing the position of an opening bracket to a stack. When a closing bracket is encountered, you pop the top-element from the stack and store it with the position of the closing bracket as a pair (position_open, position_close) in a list. The last step is to sort the list by position_open and process it with my code.

For your last point: have a look at this: https://www.ncbi.nlm.nih.gov/pubmed/19339518 (they used graph properties derived from the secondary structure as input for support vector machines).

ADDENDUM (edit): The processing above assumes a pure secondary structure without crossing or touching edges (RNAfold produces such a structure, I just wanted to mention it.)

0
Entering edit mode

0
Entering edit mode

Yea, of course, could you post your secondary structure, please? This code is quite old so I need to get back into it...

1
Entering edit mode

Ok, please pull the code from github again. I have modified that you can run it on a bracket string.

python2.7 mdg_dt.py "((.((...)).((...)).))"
[(0, 20), (1, 19), (3, 9), (4, 8), (11, 17), (12, 16)] <- output base pairs
['Multiloop', (1, 3), (9, 11), (17, 19)] <- shows the loop segments for each secondary structure motif*
['Hairpin', (4, 8)]
['Hairpin', (12, 16)]

• loop segments are defined by a) closing base pair and unpaired bases for hairpins, and b) individual bases of the closing base pair plus the unpaired bases in between for internal loops, bulges, multiloops. Ask away if you have any more questions.
0
Entering edit mode

Thank you for your effort. Let me ask some of my basic doubts (I am sorry if it is too childish)

1. Are base pairs corresponds to stem in secondary structure?
2. In the above example, can I say the structure contain 3 multiloops and 2 hair pins?
3. ['Hairpin', (4, 8)] means nucleotides from position 4 to 8 are part of a hairpin, right?
4. Is there any significance in representing each structure as a vector containing frequencies of [stemp, hairpin, internal loop, bulge] ?

Can you suggest some materials to understand both the biological and computational perspectives of RNA secondary structure?

1
Entering edit mode

You're welcome.

1. Stems (also known as helices) stable runs of consecutive basepairs. In most cases one would state that a stem has to contain at least two basepairs, as a single basepair is thermodynamically too unstable to form a stem (if such a basepair occurs in a secondary structure prediction, you might want to treat the two bases as unpaired).
2. In the above example, the structure contains 2 hairpins and 1 2-branched multiloop (there are 2 stems exiting the multiloop) with 3 unpaired regions (an n-branched multiloop can have up to n+1 unpaired regions - one between each stem - but it can also happen that there are no unpaired regions at all, e.g. ((((...))((...))((...)))).
3. Yes, but remember that the 4 and 8 belong to the closing basepair of the hairpin.
4. I think that might have some value, depending on what you are studying. I need to think on this, but you should look up "RNA structural ensemble".

Right now, I don't have a source at hand. You might find some insight in the introduction parts of my master's/bachelor's theses here: http://bioinf.darkjade.net/thesis/ (or in the cited references). If more comes to mind, I'll get back to you.

0
Entering edit mode

Thank you for the detailed reply.

I started reading your Master thesis, and it help me to get more clear understanding of the secondary structure. I am interested to know more details of definition/identification secondary structure motifs, mentioned in second chapter. If I can represent the secondary structure as a fragment of structural motifs, this can be a feature vector for classification. ( my study is identifying coding/non coding RNA by machine learning technique)

1. Do you have algorithm/implementation for motif identification?
2. Can two similar motifs can have different sequence length?
3. You have mentioned about "fold classes"- could you please explain it? which are the different fold classes,?
1
Entering edit mode

Hi,

1. Do you mean beyond what's done in the code I posted earlier?
2. ( and 3.) According to the fold class definition, they cannot. However, I don't view my definition of a fold class (same secondary structure motif with all unpaired regions and all stems being of the exact same length, which would mean that there would be too many fold classes to comfortably handle) as valid anymore. In RNA it is more interesting if there is an n-multiloop present in a structure than to exactly match the sequence lengths of all parts. Look up "RNAshapes" and its follow-up studies (https://www.ncbi.nlm.nih.gov/pubmed/16357029).
0
Entering edit mode

Thanks a lot!! Your support and suggestions are very useful and motivational.

Do you mean beyond what's done in the code I posted earlier?

I mean, what is the algorithm for identifying structural fragments ( as in Fig 2.7 in your master thesis)? What I understand is a fragment may contain more than single motif ( stem, hairpin, multiloop,...). Am I right?

n-multiloops is also seems interesting, but I couldn't find any reference to understand this. Could you please give any suggestions?

1
Entering edit mode

Hi,

I don't have the algorithm written out formally, but you can find the idea of it in the assemble() and find_all_motifs() functions in mdg_dt.py. While the output of the latter is different from what you see in my master's thesis, the general concept still applies. The reason for the difference in output is that mdg_dt.py was developed during my PhD, where I focused on loop structures.

My definitions from the master's work are a bit weak.

In general, the idea is that each RNA structure can, on the secondary structure level, be broken into paired (stems) and unpaired regions (loops). The traditional RNA secondary structure motifs* are comprised of stems and loops. These are hairpin: 1 stem terminated by one loop, internal loop: 2 stems separated by 1 or 2 unpaired regions (the former is the bulge, which is a special case of an internal loop), n-branched- (or n-)multiloop: (n>2) + 1 stems connected by [0,n+1] unpaired regions.

*) And here it gets a bit complicated, as - depending on the source you use - a stem is counted as a motif. In that case, a fragment could contain more than one motif. However, if you view a stem as something more basic, then a structural fragment will only contain one secondary structure motif.

Only if you move on to tertiary structure, then you will deal with so-called composite motifs, which are two (or more) secondary structure motifs connected by a set of tertiary interactions (base pairs or base - backbone interactions).

Hope this makes sense.

0
Entering edit mode

Thank you for your clarification. This discussions help me to get more insights on my problem. My study is whether there exist any distinctive structural motif pattern in protein coding RNAs and non-coding RNAs, so that we can use this feature for their identification using machine learning. Based on our discussion, I think the the properties of loops/stems (length, nucleotide composition,..) can be used as such feature. Thank you.