Untitled

 avatar
unknown
plain_text
2 years ago
11 kB
11
Indexable
Design Document: Maintaining Cell Jurors in DynamoDB

1. Introduction

This document outlines the design proposal to address the issue of nodes becoming RECYCLED even when the IsolateNode workflow succeeds. The proposed solution involves maintaining a mapping of jurors for each cell in DynamoDB (DDB). This mapping will provide CDS with awareness of cells for which all nodes are unreachable.


2. Problem Statement

Nodes are becoming RECYCLED even after passing the isolation safety check. This is likely due to network partitions and CDS being unable to reach all nodes on some cells. The existing node scan threshold may not be effective in preventing this issue.


3. Proposed Solution

3.1 Option 2: Maintain the Jury Set in DynamoDB

3.1.1 Overview

Maintaining a mapping of jurors for each cell in DynamoDB will enable CDS to be aware of cells for which all nodes are unreachable.

3.1.2 Components

    DynamoDB Table: A table to store the mapping of jurors for each cell.
    HealthTracker (HT): Periodic job or additional workflow steps to update DDB records.

3.1.3 Option 2.1: Use Periodic Job to Update DDB

3.1.3.1 Workflow

    Periodic Job in HealthTracker:
        Fetches node and cell data.
        Updates the entries in DynamoDB.

3.1.3.2 Pros

    CDS will have a state record of the current Cells in the colony.
    Allows flexibility to adjust the node scan threshold.

3.1.3.3 Cons

    Potential for stale data in scenarios where cells are created, but data plane contact is lost before updating DDB.

3.1.4 Option 2.2: Add Steps to All Cell Mutating Workflows

3.1.4.1 Workflow

    CreateCell in Hive:
        Informs HT to update the DDB record.
    DeleteCell in Hive:
        Informs HT to update the DDB record.
    RepairCell in HT:
        Updates the DDB record with new jurors.
    IsolateNode in HT:
        Updates the DDB record with new jurors.

3.1.4.2 Pros

    Same as Option 2.1 but makes it more robust, reducing the risk of stale data.

3.1.4.3 Cons

    Requires more development time.
    Introduces complexity into repairs.
    Extends the repair workflow time and adds another potential failure point with DDB.

4. System Architecture

4.1 Overview

The proposed solution involves maintaining a mapping of jurors for each cell in DynamoDB (DDB). This section provides an overview of the key components and their interactions in the system architecture.

4.2 Components

4.2.1 DynamoDB

    Description: A NoSQL database used to store the mapping of jurors for each cell.
    Roles:
        Table: CellJurors
        Attributes:
            CellId (Primary Key): The unique identifier for each cell.
            Jurors: Set of node IDs representing jurors for the cell.

4.2.2 HealthTracker (HT)

    Description: The component responsible for orchestrating workflows and managing the health of nodes.
    Roles:
        Periodic Job:
            Fetches node and cell data.
            Updates the entries in the CellJurors DynamoDB table.

4.2.3 CreateCell Workflow (in Hive)

    Description: Workflow triggered when a new cell is created in Hive.
    Roles:
        Informs HT: Adds a step to inform HealthTracker to update the CellJurors DynamoDB record for the new cell.

4.2.4 DeleteCell Workflow (in Hive)

    Description: Workflow triggered when a cell is deleted in Hive.
    Roles:
        Informs HT: Adds a step to inform HealthTracker to update the CellJurors DynamoDB record for the deleted cell.

4.2.5 RepairCell Workflow (in HealthTracker)

    Description: Workflow responsible for repairing cells and updating the CellJurors DynamoDB records.
    Roles:
        Updates DDB Record: Adds a step to update the CellJurors DynamoDB record with new jurors for the repaired cell.

4.2.6 IsolateNode Workflow (in HealthTracker)

    Description: Workflow triggered when isolating a node in HealthTracker.
    Roles:
        Updates DDB Record: Adds a step to update the CellJurors DynamoDB record with new jurors for the isolated node.

4.3 Interactions

    Periodic Job Interaction:
        The periodic job in HealthTracker fetches node and cell data.
        Updates the entries in the CellJurors DynamoDB table.

    Workflow Interactions:
        CreateCell Workflow:
            Informs HealthTracker to update the CellJurors DynamoDB record for the new cell.
        DeleteCell Workflow:
            Informs HealthTracker to update the CellJurors DynamoDB record for the deleted cell.
        RepairCell Workflow:
            Updates the CellJurors DynamoDB record with new jurors for the repaired cell.
        IsolateNode Workflow:
            Updates the CellJurors DynamoDB record with new jurors for the isolated node.

4.4 Data Flow

    Periodic Job Flow:
        Fetches node and cell data.
        Updates the CellJurors DynamoDB table.

    Workflow Flows:
        Each workflow adds or updates the CellJurors DynamoDB record based on the specific context (create cell, delete cell, repair cell, isolate node).

4.5 Considerations

    The CellJurors DynamoDB table is a critical component for maintaining awareness of cell jurors.
    Workflows in Hive and HealthTracker are enhanced to include steps for updating CellJurors records.
    The data flow ensures consistency between the state of cells and the CellJurors DynamoDB table.


5. Implementation Details

5.1 DynamoDB Schema

5.1.1 Table Name

    Name: CellJurors

5.1.2 Attributes

    Primary Key:
        CellId (String): The unique identifier for each cell.
    Secondary Attribute:
        Jurors (Set of Strings): Represents the set of node IDs acting as jurors for the cell.

5.2 HealthTracker Periodic Job

5.2.1 Purpose

    Fetches node and cell data periodically.
    Updates the CellJurors DynamoDB table based on the collected data.

5.2.2 Implementation Steps

    Fetch Node and Cell Data:
        Utilizes existing mechanisms to retrieve the latest node and cell information.

    Update CellJurors Table:
        Iterates through the fetched data and updates the CellJurors table accordingly.
        Compares the existing CellJurors entries with the fetched data to ensure consistency.

5.3 Workflow Enhancements

5.3.1 CreateCell Workflow (in Hive)

    Purpose: To inform HealthTracker about the creation of a new cell.

    Implementation Steps:
        Adds a step to the workflow to trigger HealthTracker.
        Communicates the details of the new cell (e.g., CellId) to HealthTracker.

5.3.2 DeleteCell Workflow (in Hive)

    Purpose: To inform HealthTracker about the deletion of a cell.

    Implementation Steps:
        Adds a step to the workflow to trigger HealthTracker.
        Communicates the details of the deleted cell (e.g., CellId) to HealthTracker.

5.3.3 RepairCell Workflow (in HealthTracker)

    Purpose: To repair cells and update the CellJurors DynamoDB records.

    Implementation Steps:
        Performs cell repair operations as usual.
        Adds a step to update the CellJurors DynamoDB record with new jurors for the repaired cell.

5.3.4 IsolateNode Workflow (in HealthTracker)

    Purpose: To isolate a node and update the CellJurors DynamoDB records.

    Implementation Steps:
        Performs node isolation operations as usual.
        Adds a step to update the CellJurors DynamoDB record with new jurors for the isolated node.

5.4 Data Flow

    Data Flow from Periodic Job:
        The periodic job fetches node and cell data.
        Updates the CellJurors DynamoDB table based on the fetched data.

    Data Flow from Workflows:
        Each relevant workflow adds or updates the CellJurors DynamoDB record based on the specific context (create cell, delete cell, repair cell, isolate node).

5.5 Testing Considerations

    Unit Testing:
        Test each workflow enhancement in isolation.
        Ensure that the CellJurors DynamoDB table is updated as expected.

    Integration Testing:
        Simulate various scenarios (e.g., cell creation, deletion, repair, node isolation) and validate the interactions between workflows and the CellJurors DynamoDB table.

5.6 Deployment Considerations

    Gradual Rollout:
        Deploy the enhanced workflows to a subset of the environment first to ensure stability.
        Gradually roll out the changes to the entire environment.

    Monitoring:
        Implement monitoring for the CellJurors DynamoDB table updates.
        Set up alerts for any anomalies or failures in the data flow.




6. Conclusion

6.1 Chosen Solution

After a careful analysis of the identified issues and proposed solutions, the team has decided to implement Option 2: Maintain the Jury set in DynamoDB. This solution addresses the root cause of nodes being marked as isolation safe when they are still part of active cells.

6.2 Pros and Cons

6.2.1 Pros

6.2.1.1 Improved Isolation Safety Check

    Pro: The proposed solution enhances the isolation safety check by maintaining an independent record of cell jurors, reducing reliance on real-time communication with PhysaliaServer.

6.2.1.2 Increased Robustness

    Pro: The addition of workflow steps in relevant cell and node operations improves the robustness of the system by ensuring timely updates to the CellJurors DynamoDB table.

6.2.1.3 Long-Term Viability

    Pro: This solution lays the foundation for a more sustainable and scalable approach to handling cell information, providing flexibility for future enhancements.

6.2.2 Cons

6.2.2.1 Development Effort

    Con: The implementation of workflow enhancements, especially the addition of steps in repair and isolation workflows, requires significant development effort and introduces complexity.

6.2.2.2 Workflow Time Extension

    Con: The additional steps in workflows might extend the time taken for cell repair and node isolation operations, potentially impacting repair workflow efficiency.

6.2.2.3 Potential Stale Data

    Con: The periodic job fetching node and cell data for updating the CellJurors table may introduce a slight delay, leading to potential stale data scenarios.

6.3 Decision Rationale

Despite the identified challenges, the chosen solution aligns with the long-term goals of the system, offering a more comprehensive and scalable approach to managing cell information. The development effort required is deemed justifiable given the expected benefits in terms of improved isolation safety and system robustness.

The team will closely monitor the implementation, conduct thorough testing, and address any unforeseen challenges during the deployment phase. Regular feedback loops will be established to iteratively improve the solution based on real-world usage.

Editor is loading...
Leave a Comment