Post-Mortem: Strangelove Validator Jailing on dYdX Chain
Nov 29, 2023
Dan Bryan
Incident Overview
On November 10, 2023, Strangelove experienced an unexpected jailing on the dYdX chain. This event was a result of a unique combination of technical challenges and network conditions, leading to a temporary performance issue.
What Happened?
The incident originated from an issue with one of our cosigners. During the night, our Horcrux system correctly initiated a leader election, but the new leader cosigner was located in a datacenter with the longest round-trip time (RTT) to the others. This specific cosigner had an RTT of 35ms, while the others were at 5ms. Although this may seem negligible, it was significant enough to push our block sign times just beyond the threshold for the low sign times mandated on dYdX. As a consequence, our signing rate intermittently fell below 20%. During the approximately 3-hour slashing window defined by dYdX, our uptime further declined to 20%, resulting in our validator being jailed. This event highlighted the stringent requirements of the dYdX chain, where a 3-hour slashing window provides limited time for validators to respond to such incidents.
Immediate Response and Fixes
We temporarily added a Horcrux signer in Tokyo using Google Cloud, which improved our latency issues. This was a swift but temporary solution.
We released an update to Horcrux (v3.2.0), enhancing our remote signer's efficiency. The update included pre-signing the nonce and using gRPC multiplexing, reducing the transaction time and network traffic from approximately 5ms plus network latency for four round trips, to about 2ms plus network latency for a single round trip.
Future Mitigations
We are considering proposing a governance change to extend the slashing window on the dYdX chain. This change aims to align with other Cosmos chains, offering a more reasonable window of around 15 hours for validators to address incidents.
Impact on Users and Reputation
Fortunately, in this case, dYdX does not implement a slashing penalty, which means no user funds were lost or needed to be reimbursed. However, being jailed on-chain can affect a validator's reputation, as it is often perceived as a sign of negligence. We want to assure our delegators and the community that this was a unique technical and geographical challenge and not negligence on our part.
Lessons Learned and Moving Forward
This incident has been an important learning experience for us at Strangelove. It underscores the necessity of continuously evolving and adapting our infrastructure and strategies to meet the diverse requirements of different blockchain networks. We remain committed to providing the highest standard of service to our delegators and to the wider cosmos community.