Horcrux v2.0 Has Been Released!
Introducing the Horcrux Threshold Signer
Horcrux is the premier high-availability key management infrastructure for Cosmos validator operations offering highly available key management infrastructure. It brings significant benefits: additional security and better uptime. This post by @larry0x goes into detail on the different architectures that are available to validators in Cosmos and the benefits that Horcrux brings.
History of Cosmos Threshold Signers
The first threshold signer for Cosmos was written by Roman Shtylman while he was at Polychain Labs. This initial implementation used a very simple custom p2p protocol to connect the nodes and share signatures. While it used local and cluster-wide high watermark files to keep track of the latest block signed and prevent double sign, several feature areas like validator proposals and operator ease of use remained unaddressed. Strangelove first started using it in production during the cosmoshub-3 -> cosmoshub-4 upgrade. The experience of manually modifying the priv_validator_state.json files on each node and then waiting 15 minutes for sentries to come online while hoping to avoid a double sign was nerve wracking. It convinced us that this product needed better testing and tooling to make usage easier. This core insight led to the Strangelove fork that has become Horcrux.
The first challenge to overcome was to build a testing framework that would validate the behavior of the cluster and ensure that it doesn’t double sign under given use cases. This same test framework would also verify that the refactor kept these properties and added new ones. We made the decision to do this testing in a block-box environment that spun up the full system from the built binaries and allowed for easy testing of different network topologies and failure scenarios. The README on the osmosis testing framework (based on the horcrux framework) shares some reasons for this choice. For examples, check out the horcrux tests. One thing the test framework showed was that the system failed to sign proposals. After exploring some options to fix this, it became clear that a leader election system (i.e. consensus mechanism) was required to solve this problem.
High Availability and Fault Tolerance
Prior to Horcrux v2 beta, Horcrux attempted to reach all other nodes for each block sign process, and would error if any of the other nodes were unreachable, causing a failure to sign the block. With v2 refactors including the Raft Implementation and GRPC Horcrux Node Communication mentioned below, the threshold validation process has been improved to be in line with the threshold key possibilities. For every block sign request, only the threshold number, used for sharding the key, number of nodes is needed to successfully sign the block and return the signature to the requesting chain nodes. This improves the overall reliability of the Horcrux cluster so that scenarios such as downed nodes or network issues like temporary disconnections and high latency can be overcome so that signing blocks can continue.
The Raft consensus algorithm allows for reliable leader election in an arbitrary group of nodes. This empowered reliable proposal signing and improved reliability over previous versions of the threshold signer. With the addition of leader election, every block sign request will be managed by the leader node. The leader will request and share all necessary information between the other nodes to assemble a full signature for each block. If a block sign request reaches a horcrux node that is not the leader, it will proxy the request to the leader. The leader can handle concurrent sign requests for the same block, and will only sign each block once. All subsequent requests for the same block will return the same signature as initially signed. Horcrux additionally uses Raft consensus for the high watermark tracking of the highest block that has been signed by the cluster. This provides additional slash protection against double-sign.
GRPC Horcrux Node Communication
Prior to the Raft integration, Horcrux used RPC for sharing data between horcrux nodes. The introduction of Raft meant that Horcrux had two services that required communication amongst nodes. Initially, Raft was added as a separate service with a separate port, requiring operational overhead of more shared ports to firewall between nodes. With the addition of GRPC to replace the RPC server, and the use of Jille/raft-grpc-transport, the node communication was consolidated to a single port.
Improved Block Sign Performance
The optimizations in v2 brought in highly reliable communication between nodes and more advanced p2p features that significantly lowered latency. The completed code was not only more reliable and passed tests that the prior implementation never could, it was also much faster. The average signature generation time went from an average of ~60ms to ~25ms with max signature time decreasing from ~100ms to ~30ms.
After the completion of the Raft implementation, Strangelove began to use Horcrux in production. With the prevalence of bridge hacks, Cosmos validators are now facing intrusions from nation-state level actors (e.g. North Korea) looking to compromise deployments. Horcrux is a key weapon in the fight to prevent these types of attacks and to mitigate the impact of any downtime they may cause. We also began working with other validators who were interested in Horcrux to improve their respective node operations. Conservatively around 50 validators run the new version of Horcrux in production and it is quickly becoming a standard for high quality validator operators in the Cosmos Ecosystem. Agoric’s security team recommends its usage for the increased security and reliability.
Prometheus metrics for the cluster
HSM Integration for horcrux
Distributed Key generation version of horcrux