Be Kind to your Validators

Mar 07, 2023

David Nix

Operating blockchain nodes is a challenging endeavor and Cosmos nodes are no exception. From how to secure your validator secrets to the many nuanced config settings, running a node requires great skill from the node operators.

It’s no secret validators are crucial to your network. The following are some Dos and Don’ts to make a validator’s life easier.

Do use semantic versioning 2.0 for upgrades

You don’t want node operators to upgrade to the wrong version. You don’t want operators to ignore an upgrade. You don’t want validators failing to achieve consensus and halting the chain because they picked the wrong upgrade.

Node operators often run infrastructure for multiple chains, not just yours. The more unique and bespoke your naming and upgrade patterns, the more likely it is for operators to miss or misinterpret an upgrade signal. This makes it important to use consistent upgrade versions that repeat patterns from other chains.

We’re not going to remember or immediately recognize your internal code name for upgrades. Don't get clever or cutesy. Use Semantic Versioning 2.0. Semver makes it obvious what is a major, minor, and patch upgrade and is a ubiquitous pattern in the industry. Conventions are good for maintainability.

Don’t assume how someone will host a node

Provide high level instructions for operators to understand how your nodes can be deployed, but don’t provide instructions or scripts for a specific hosting environment, except perhaps as implementation examples.

In other words, tell your validators what they need to achieve, not how to achieve it.

A node operator may choose bare metal servers, VMs, containers, or some other configuration or combination of implementations. Don’t assume anything about their approach, and certainly don’t assume everyone is running bare metal. This may be true for early implementations or for single-chain validators, but as operators grow and mature, and as they support more chains, they will likely move more into infrastructure-as-code deployments.

Strangelove bucks the bare metal trend and runs everything in Kubernetes. We do this to leverage powerful abstractions provided by Kubernetes that simplify our complex deployment environment. Monitoring and observability are easier in more instrumented environments, and as the scale of deployments grows, abstractions like Kubernetes allow a small team of DevOps professionals to maintain a growing fleet of validators and other Cosmos infrastructure. Strangelove wrote the Cosmos Operator to address these very needs.

Do add your info to the Chain Registry

Make an entry in the Chain Registry and keep it updated. We and other validators rely on the registry staying current to help us debug issues such as peering. We may need to switch persistent peers or use different seed nodes, for example, and if the list is out of date it can be difficult to identify the source of the problem.

This is where the current decentralization isn’t so great. The Chain Registry has its flaws, but it is a single source of truth with structured data. Therefore, we can build automation around it.

Do add your testnets to the Chain Registry

Although not obvious, the chain registry supports testnets in the testnets subdirectory. Testnets are important for proving out integrations and for troubleshooting issues, and a validator’s job is much simpler if the testnets are easily discoverable.

Don’t have a single point of failure

If your validator requires processes outside the standard Cosmos SDK, make sure they can be run in a highly-available configuration. Forcing your validator operators to run a single point of failure is guaranteed to cause frustration and more frequent downtime.

We run every validator with redundancy, thanks to Tendermint’s remote signing and Horcrux’s key sharding. This gives our SRE team the peace of mind that we can tolerate individual instance failures so that every outage doesn’t become a fire drill. The deployment as a whole continues to operate while we bring the failed instance back up.

If your architecture doesn’t allow us to run multiple redundant copies of core processes, then each failure becomes a stressful and urgent recovery operation, and that is unkind to your operators and detrimental to the stability of the chain.

Do adopt communication conventions

Validators often support many chains. The more chains use similar communication patterns, the easier it is for validators to stay on top of any changes. The current convention for announcing chain upgrades and other events in Cosmos is currently Discord. There are likely better tools for this but Discord is by and large the convention for chains to give announcements. You can make announcements any way you prefer, but please also follow the Discord conventions to make things more discoverable and monitorable for everyone.

At a minimum, maintain an #announcements channel and notify @everyone for all posts about upgrades. A nice improvement is to let validators opt in to a validator role and announce via @role.

Osmosis provides a good example of how to run a Discord server and follow conventions.

Don’t have unpredictable upgrade schedules

Upgrade schedules should be as boring as their naming convention. Schedule your upgrades for times when the majority of operators will be focused on work and easily reachable. Find a time mid-week and mid-day for the bulk of your validator operators and stick to upgrades during that window.

Don’t plan upgrades when people are unavailable to handle issues, like on weekends or around major world holidays such as Christmas or New Years. One chain had an upgrade four days before Christmas 2022. This is dangerous because many people are on vacation around this time and many others are distracted or only partially working. Luckily, the upgrade went smoothly, but had there been an urgent issue it would have been difficult to get in touch with operators to fix it.

For similar reasons, avoid upgrades on the day before a weekend. Node operators will be quicker to respond during the work week, and sometimes major bugs don’t surface immediately.

Do respond to bugs and create workarounds quickly

If a chain upgrade goes wrong operators want to get everything back up and running as quickly as possible. Be sure to broadcast the latest findings, recommendations, and timelines in a public channel so all validators can stay up to date. A collaborative approach also lets validators assist each other, lessening the communication load on your team.

Don’t use private repos

Chain binaries should be public whenever possible. Private code repositories come with headaches like the need to grant access. This means validators can’t easily bring in new folks to help tackle issues, and fixes might get stuck because of a lack of access to your code rather than a lack of ability or desire to help.

Public repos also encourage better responsiveness and security practices, generally speaking.

Do have official upgrade instructions

Upgrade instructions should be easy to find and read. They should be in a consistent format every time, and they should describe what to do if any of the upgrade steps fail. Have a website that links to upgrade instructions and make them easy to find from your chain binary repo.

Do test your upgrades

You don’t need to test every environment or architecture, but you should definitely validate that your chain upgrade works for a simple happy path upgrade scenario. Strangelove publishes interchaintest, a testing framework for IBC chains, that can validate upgrade scenarios as part of your CI/CD practice. For consultation on more complex testing and validation practices, please contact us at hello@strange.love.

Read More Like This


Sunsetting the Public Voyager API

Sunsetting the Public Voyager API

Sep 05, 2023

IBC: A roadmap for the interchain future

IBC: A roadmap for the interchain future

Aug 29, 2023

Incident Report: Akash Validator Downtime Slash

Incident Report: Akash Validator Downtime Slash

Apr 11, 2023