Associate Data Coloring

Introduction

Helices: In Ribosome, these are structures that are formed by RNA molecules due to complementary base pairing. They stabilize RNA's secondary and tertiary structures.
Ancestral Expansion Segments (AESs): Segments within RNA sequences that have expanded over evolutionary time, likely through duplication or amplification events. These segments are added stepwise to the growing rRNA core.
Phase: Reading frame or coding frame of RNA sequences, influencing protein translation. Dictates how sequences are interpreted, impacting protein structure and function.
All of the above are structural elements of the ribosome and will be collectively addressed as such. The secondary structure for any selected organism is either fetched from the API or generated by R2DT and mapped to the MSA. The structural boundaries are inferred by mapping the structure to an anchor sequence (H. sapiens in this case) that has been properly annotated. However, there will be many unresolved regions as not all regions of the ribosome will be present in a much simpler organism (for instance, T. thermophilus). In the image, those uncolored (black) nucleotides are the unresolved regions for helical definitions in this structure. Additionally, some regions may end up with incorrect definitions of such boundaries, as shown in the figure where base-paired nucleotides belonging to different helices are depicted.

Ribosome as Graph

We begin by reimagining the secondary (and 3D) structures as an undirected graph where the nucleotides represent the vertices and the bonds (Hydrogen and Phosphodiester bonds) represent the edges. Since there will be some non-nested hydrogen bonds, in this approach, we tend to ignore them by calculating the Euclidean distance and filtering out the longer ones.

Nearest Neighbours

Input: Graph G representing RNA structure.
Initialize an empty list to store colored vertices.
Traverse the graph in two passes:
a. Traverse from the first colored vertex to the end in the forward direction.
b. Traverse from the first colored vertex to the beginning in the reverse direction.
For each uncolored vertex encountered:
a. Identify its neighbors with a shortest path of 4 edges.
b. Compile a list of colors from these neighbors.
c. Select the color with the highest frequency/occurance.
d. Color the vertex with this selected color.
Repeat steps 4-5 until all vertices are colored.
Output: List of colored vertices representing resolved nucleotides.

This approach more or less fixes all the miscolored vertice issue, but as depicted in the image in some cases it colors base pairs wrongly being more influnced by it's neighbours.

Graph Segmentation

Stacked Neucliotide Segments

Overview

Starting from the first vertice (neucliotide) the graph is travarsed only using stacked edges (Phosphodiester bonds).
With a minimum threshold of 4 vertices (neucliotides), a segment is considered a group of verticies where their degree is the same.
The degree of a vertice is defined as number of edges connecting the vertice to it's direct neghbours.
Following Image is the result of segmenting the graph with this algorithm.

Algorithm

Input: Graph G representing RNA structure.
Initialize an empty list to store segments.
Traverse the graph starting from the first vertex.
For each vertex encountered:
a. Traverse only using stacked edges (Phosphodiester bonds).
b. Define a segment when the degree of a vertex remains the same for at least 4 vertices.
c. Store the segment in the list of segments.
Output: Segmented graph with stacked nucleotide segments.

This will be an ideal approach for AES and Phases but this doesn't take into accound the base pairs for which this fails to segment helices.

Base-Pair Segments

Overview

We leverage the Dot-Bracket notation, where each base pair is represented as ( and ) and intermedicate non basepaired regions are depicted by ..
While traversing this notation, we define a segment as the outer most base pairs of a non nested innermost base paired regioon. So a segment in (..((..))..) depicting ACGCGAACGGCU the two segments would be (....) / [ACG GCU] and ((..)) / [CGAACG].
In case of situations of nesting, (()..()) we consider 3 segments as inner most () as two and outer beackets as another segment. We leave .. the inbetween region as unresolved.
Following Image is the result of such segmention where black vertice represent such unresolved regions.

Algorithm

Input: RNA sequence represented in Dot-Bracket notation.
Initialize an empty list to store segments.
Traverse the Dot-Bracket notation.
For each segment:
a. Identify the outermost base pairs of a non-nested innermost base-paired region.
b. Define a segment using these outermost base pairs.
c. Store the segment in the list of segments.
Output: Segmented regions representing base-paired segments.

This will be an ideal approach for Helices as the base pairs are considered well.

Final Approach


Default	Base-Paired Segments	Unresolved regions colored

For helices, we use the Base-Pair segments and for AES/Phase, we use stacked neucliotide segments and fill the entire region with the maximum frequency availabe colors. This will still result in some uncolored regions for those we fill them using nearest neghbours approach.

This same approach is taken for the 3D viewer.