Incident Report: Akash Validator Downtime Slash
Apr 11, 2023
From April 9th-10th, the Strangelove Akash validator stopped signing blocks and was subsequently slashed. Here's what happened, what we learned, and how we're addressing the issue.
Friday April 7th, 12:30 AM PT: First sentry out of disk space alert.
Sunday April 9th, 5:15 PM PT: All sentries out of disk space. First missing blocks alert.
Monday April 10th, 9:06 AM PT: Validator is slashed.
Monday April 10th, 3:27 PM PT: Team becomes aware of issue and begins response.
Monday April 10th, 4:27 PM PT: Validator back online.
Monday April 10th, 5:25 PM PT: Downtime slash refund transaction sent to all delegators.
What Went Wrong
We identified three main factors that contributed to the incident:
- Monitoring and alerting issues: We are using half-life, a monitoring tool that we built in early 2022 to provide visibility into our deployments. It was useful for our smaller scale at the time but has become less helpful as we continue to scale up our operations. Additionally, we had mis-configured alert outputs. Testnet validators were alerting in the same channel as our mainnet validators, making it easy for us to miss critical information. Half-life publishes alerts to Discord, and Strangelove migrated to Slack last year due to improved B2B collaboration. This disparity has created monitoring challenges.
- Legacy deployment method: Our Akash validator was still running on our legacy deployment system, rather than our newer Cosmos Operator system.
- Lack of attention to monitoring: We have been focusing on refactoring our internal monitoring systems, causing us to give less attention to half-life. As a result, we missed the alerts from April 7th through the weekend.
The incident highlights the need to hasten the transition to our new monitoring and alerting pipeline and to bring all our deployments up to date with the Cosmos Operator. To address this issue, we're prioritizing the following:
Migrating our alerting pipeline off of half-life to our new stack, which utilizes Prometheus, Grafana, and Ops-Genie.
Improving our process around operator on-call rotation.
Moving all our deployments to the Cosmos Operator to take advantage of scaling enhancements, such as disk auto-scaling.
Making It Right
We refunded the downtime slash amount to our delegators using Lavender Five's slash_refunds_tendermint. The transaction can be found on the Akash chain here. We're grateful to Lavender Five for their tool, which helped us construct the unsigned transaction with all the refund amounts so we could sign and broadcast it to the chain.
In conclusion, this incident served as a wake-up call for us to improve our monitoring, alerting, and deployment processes. We're committed to taking the necessary steps to ensure that our validators remain reliable and secure.