Sunnyside labs has been working on integrating chaos-mesh to the existing devnet infrastructure to make chaos testing easier on the running devnets. Chaos Mesh is a tool that lets you inject failures such as network partitions, pod crashes, or resource stress, so you can observe how the system recovers and improve its resilience.

In short, by integrating chaosd to individual node and orchestrate the nodes via the chaos-mesh dashboard, we can easily trigger faults in the nodes via the dashboard.

Introduction

With the chaos testing infrastructure, we conducted our first tests by splitting the network into half to trigger node split of 50:50. We let the chain continue for 1, 2, 6 hours, then monitored the network’s recovery.

This creates a complete partition in the network by blocking a list of node’s IP via iptables rules. In addition to non-finality tests like done via creating bugs in opcode, we believe that network-level partitioning can also provide some insight into more adversarial scenarios.

network split shown in dora

network split shown in dora

Also, we tried re-syncing half of the nodes at the same time to see how much the other half nodes suffer while serving data to sync the new peers to head.

Devnet Information

Setup & Configurations

Node combination 60 nodes w/ 5 x Grandine/Lighthouse/Prysm/Teku x Geth/Nethermind/Reth
Hardware 8 vCPUs / 16GB RAM / NVMe SSDs
Validator distribution 8 validators per client
Supernode distribution 10% super node / 90% full node
Bandwidth limitation None
Client Image versions
Network config https://github.com/testinprod-io/fusaka-devnets/tree/main/network-configs/devnet-ssl-chaos-3

Test Results

Here are the times to recovery for each network split duration, as well as the dashboards to view the status.

| Network split | Recovery | Recovery result | Peak Bandwidth (in+out) Usage | Dashboards | | --- | --- | --- | --- | --- | | 1 hour | 14 min (slot 640 to 709) | 14 min to sync agg 88% | supernode: 40Mbps

fullnode: 20Mbps | Grafana Dora | | 2 hours | 10 min (slot 2692 to 2741) | 10 min to sync agg 73%

7 nodes stuck - restarted to fix | supernode: 40Mbps

fullnode: 10 Mbps | Grafana Dora | | 6 hours | 28 min (slot 2677 to 2815) | 28 min to reach sync agg 78%

9 nodes stuck - restarted to fix | supernode: 125Mbps

fullnode: 42Mbps | Grafana Dora | | 50% resync | 6 min | 6 min for all nodes to resync successfully | supernode: 65Mbps

fullnode: 13Mbps | Grafana |

Results

With the latest client images (& after fixes from devnet-3) all synced after hours of network split without big issues.

All of these cases finalized within 30 minutes. As we scaled the test from 2 hours to 6, the recovery time as well as the bandwidth usage was quite proportional. And the bandwidth usage was mostly lower than the guideline provided in EIP-7870.

With longer hours of network split, nodes had higher chance of not merging back into the correct chain - node restarts were required to bootstrap them back to the correct chain.