Testnet Stall 7/8/2022 Post Mortem

Algorand Testnet Outage

On July 8th, Algorand’s Testnet was stalled from approximately 12:15 PM ET until approximately 5:15 PM ET when it resumed normal operation.

Impact

No tokens, records, etc were lost, and the protocol behaved as expected.

Root Cause

A bug in code populating an in-process cache, used during block validation, caused valid transactions to be considered invalid by voters. To be clear, the bug was not in the consensus code and a stall is the expected behavior of the protocol when presented with this scenario.

Stepping back, nodes validate transactions in two major places:

  • When building a block to be proposed
  • When validating a proposed block (from a proposer) prior to voting to certify it.

To ensure consensus is reached, the same block validation rules must be applied at proposal time and voting time. In this case, blocks were successfully validated prior to proposal, but an unexpected exception in an optimization thread used by voters to validate blocks meant the blocks were rejected by the voters. This cycle continued as the protocol selected new proposers and each new proposer proposed a block that was not able to be validated by the voters.

How We Responded

Our internal alerting picked up the stall in under 60 seconds and the Algorand development team was digging into the issue shortly after.

We spent roughly an hour identifying/verifying the issue and another 30 minutes preparing a patch code change. By approximately 3pm ET, we were running our build/testing infrastructure preparing a security release which would resolve the cache issue and allow TestNet nodes to reach consensus. Importantly, this patch (v3.8.1) did not require a consensus upgrade. Testnet node runners received the patch around 5pm ET and roughly 15 minutes later testnet was back up and running.

The patch release was also made available for any MainNet node runners. The community of node runners reacted swiftly and MainNet was resilient to the issue by roughly 11 PM ET.

Thank you to our ecosystem for their continued dedication to the security and progress of Algorand.

16 Likes