Untitled

Revisiting Isolation Safety Logic

Background

PhysaliaColonyDataService (CDS) is a color-based sweeper service that scans the colony and does health checks in order to schedule repairs on cells and nodes.

When a server isolation workflow is triggered, part of the requirements is to first Recycle the PhysaliaServer node. Nodes can also be recycled manually by Physalia operators in certain situations (e.g., to increase DG diversity of a small outpost).

In order to completely Recycle a node, the node must not be a part of any Cells. Thus, we perform what is referred to as an “Isolation Safety check”, prior to allowing the IsolateNode workflow to continue.

If the node does belong to any cells, we schedule RepairCell workflows on all of the cells that this node belongs to, and replace the recycling node with a new node on each Cell. Once the recycling node has been removed from all of its cells, it will pass the isolation safety check, and the IsolateNode workflow can succeed.

CDS also has a node scan threshold, where if a certain % of nodes failed to be scanned (currently set to 10%), it will not process any nodes as isolation safe.

Problem

We have noticed increased occurrences of cells with RECYCLED nodes on them. The only way a node can ever become RECYCLED is for the IsolateNode workflow to succeed. Such a scenario where an active Cell has RECYCLED nodes should never happen if our isolation safety check is doing its job. This has implications of cells becoming minority, as well as other bad states e.g. two nodes with different nodeIds but the same ip.

Until recently, the only times we had seen occurrences of the isolation safety check failing was in outposts. Because CDS scans all outposts in a zone together, it is possible for an entire outpost to be disconnected in a zone, and CDS to still be above the node scan threshold.
However, we recently saw this same phenomenon occur in DXB Substrate. We believe a network partition occurred, resulting in CDS being unable to reach all of the nodes on some Cells, yet still passing the node scan threshold check.

While testing in IDM, I was able to reproduce this issue by bypassing the node scan threshold and deactivating PhysaliaServer on all 7 replicas. Manually recycling the node succeeded because CDS marked the node as isolation safe.

This is because CDS reaches out to PhysaliaServer to perform a GetItemsOnCellsRequest for each node. If no node of a given Cell is able to answer this request, the Cell is therefore completely unknown to CDS.

Solution

Option 1: Change the Node Scan Threshold to be >= Replication Factor (Recommended as temporary solution)

Today the node scan threshold is 10%, meaning that in any color, in any zone, we declare a scan as successful if no more than 10% of the nodes were unresponsive.

This one-size-fits-all approach to the problem means that we are potentially allowing any number of cells to be excluded from CDS’s knowledge, depending on the size of the zone.

The proposal is to change from percentage based scan to a strict numerical threshold where we fail the scan if an entire Cell could have been left out. E.g. in outpost if 3 nodes fail then we do not post process.

Pros:

* Simple to implement, can deploy this quickly to prevent any further impact
* Removes the one-size-fits-all threshold

Cons:

* Scans will be more likely to fail, resulting in repairs taking longer to be scheduled
* Likely to still be susceptible to network partition.

Option 2: Maintain the Jury set in DynamoDB (Recommended as long term)

Because we rely on PhysaliaServer to tell us about the cells for each server, we currently don’t have any way for CDS to be aware of cells for which all of the nodes are unreachable.

If we maintain a mapping in DDB of the jurors of each cell, then we can use this data to better inform our repairs.

Option 2.1 Use periodic Job to Update DDB

We can add a periodic job in HealthTracker that fetches node and cell data and updates the entries in DDB. CDS then will use this data instead of calling GetItemsOnCellsRequest.

Pros:

* CDS will have a state record of the current Cells in the colony
* Allows flexibility to increase the node scan threshold, since we will still have a snapshot of the cells recent states. Or we can do away with it entirely

Cons:

* Could potentially have stale data.
* Ex: a cell is created, then we lose contact with the data plane before any write to DDB occurs. If a node on that cell is recycled while still unreachable, we could encounter the same scenario

Option 2.2 Add steps to all Cell mutating workflows (Recommended)

An additional workflow step in each of the following, to update the DB record of the new jurors for the cell:

1. CreateCell in Hive, to tell HT to update the DDB record
2. DeleteCell in Hive, to tell HT to update the DDB record
3. RepairCell in HT
4. IsolateNode in HT

Pros:

* same as Option 2.1, but makes it more robust so we don’t have stale data

Cons:

* Requires significantly more development time
* Introduces more complexity into repairs
* Extends the repair workflow time and adds another potential failure point with DDB

Appendix

Manually recycling the node succeeded with the following result in CDS for the isolation safety check:

18 Oct 2022 23:22:06,907 [INFO] NodeSweeperService-DEFAULT_ONEBOX_COLOR-52471aab-f64d-4fc0-a719-dfe689380861 (post-processing-main-executor-9) com.amazon.physalia.processingservice.p rocessors.IsNodeIsolationSafeProcessor: IsSafeToIsolate result for node 409aaf06-262e-490c-93fa-4bf85619afa7 safeToIsolate true. isDPViewEmpty true isLocalViewEmpty true isReachable false doesNodeHaveNoCellsOrOrphanCellsOnly false

The reason the safety check passes is because of the way we calculate isDPViewEmpty. If the entire cell is unreachable, it doesn't show up in the cellInfoMap:
https://code.amazon.com/packages/PhysaliaColonyDataService/blobs/eca51281b1cf6556f59[…]rocessingservice/processors/IsNodeIsolationSafeProcessor.java

https://code.amazon.com/packages/PhysaliaColonyDataService/blobs/mainline/--/src/com/amazon/physalia/colonydataservice/colonydata/nodesweep/NodeScanItemTaskImpl.java#L105

IDM testing methods: https://w.amazon.com/bin/view/Users/tangelop/IsolationSafetyRulesTestIDM/

Editor is loading...