Sunnyside labs has been working on integrating chaos-mesh to the existing devnet infrastructure to make chaos testing easier on the running devnets. Chaos Mesh is a tool that lets you inject failures such as network partitions, pod crashes, or resource stress, so you can observe how the system recovers and improve its resilience.
In short, by integrating chaosd to individual node and orchestrate the nodes via the chaos-mesh dashboard, we can easily trigger faults in the nodes via the dashboard.
With the chaos testing infrastructure, we conducted our first tests by splitting the network into half to trigger node split of 50:50. We let the chain continue for 1, 2, 6 hours, then monitored the network’s recovery.
This creates a complete partition in the network by blocking a list of node’s IP via iptables rules. In addition to non-finality tests like done via creating bugs in opcode, we believe that network-level partitioning can also provide some insight into more adversarial scenarios.
network split shown in dora
Also, we tried re-syncing half of the nodes at the same time to see how much the other half nodes suffer while serving data to sync the new peers to head.
Node combination | 60 nodes w/ 5 x Grandine/Lighthouse/Prysm/Teku x Geth/Nethermind/Reth |
---|---|
Hardware | 8 vCPUs / 16GB RAM / NVMe SSDs |
Validator distribution | 8 validators per client |
Supernode distribution | 10% super node / 90% full node |
Bandwidth limitation | None |
Client Image versions | ‣ |
Network config | https://github.com/testinprod-io/fusaka-devnets/tree/main/network-configs/devnet-ssl-chaos-3 |
Here are the times to recovery for each network split duration, as well as the dashboards to view the status.
| Network split | Recovery | Recovery result | Peak Bandwidth (in+out) Usage | Dashboards | | --- | --- | --- | --- | --- | | 1 hour | 14 min (slot 640 to 709) | 14 min to sync agg 88% | supernode: 40Mbps
fullnode: 20Mbps | Grafana Dora | | 2 hours | 10 min (slot 2692 to 2741) | 10 min to sync agg 73%
7 nodes stuck - restarted to fix | supernode: 40Mbps
fullnode: 10 Mbps | Grafana Dora | | 6 hours | 28 min (slot 2677 to 2815) | 28 min to reach sync agg 78%
9 nodes stuck - restarted to fix | supernode: 125Mbps
fullnode: 42Mbps | Grafana Dora | | 50% resync | 6 min | 6 min for all nodes to resync successfully | supernode: 65Mbps
fullnode: 13Mbps | Grafana |
With the latest client images (& after fixes from devnet-3) all synced after hours of network split without big issues.
All of these cases finalized within 30 minutes. As we scaled the test from 2 hours to 6, the recovery time as well as the bandwidth usage was quite proportional. And the bandwidth usage was mostly lower than the guideline provided in EIP-7870.
With longer hours of network split, nodes had higher chance of not merging back into the correct chain - node restarts were required to bootstrap them back to the correct chain.