docs/blueprint: partial rollout automation

Change-Id: I355d7071da7e501948665ee153ed641722a3d264
This commit is contained in:
JT Olio 2023-04-17 17:09:10 -04:00 committed by JT Olio
parent 2592aaef9c
commit e90612c285

View File

@ -0,0 +1,124 @@
# Partial rollout automation
## Abstract
This is a proposal to partially automate some of the version rollout process.
The version rollout process right now, for better or worse, is taking manual
action every 5%. This is bad so we want to fix it.
## Background/context
We have a bunch of different problems around safely keeping storage nodes
up to date in a safe way. We want to:
* Make sure that storage nodes are not running stale or old code.
* Make sure that new code doesn't break many nodes at once.
These two goals are at a bit of a tension - if we automatically update all nodes
immediately, then we risk an update breaking a significant portion of the
network.
We landed on this design:
https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md
This design allows us to control what percent of the network is eligible for an
update. The design of this system intended to have an exponentially increasing
rollout scheme, where storage node updates would upgrade a small percentage
of the network, maybe 5%, and then see what happens. If that 5% looked good,
then maybe we'd double it, and so on, until the network was upgraded.
The upside to this exponential scheme is that humans are only involved at growth
points just to double check that the network hasn't fallen over. This is an
important feature and something we want to preserve.
The downside of this scheme is that the later stages of the rollout process
potentially involve lots of nodes upgrading at the same moment. Because storage
node operators are often eager to get the latest update when it is available
to them, we potentially risk having half or more of the network down for an
otherwise safe upgrade at a time.
The data science team suggested we don't do an upgrade increment larger than 5%
every 6 hours.
In practice this means that we are now pushing the rollout along at 5%
increments every rollout, and it's tiring.
## Design and implementation
The intention of the below design is to resolve the data science team's
concern, while still allowing us to do exponentially increasing rollouts.
This design still intentionally requires multiple PRs per rollout, but
instead of one every 5% (20 PRs), we would have just 5 (or maybe
slightly less, but not 1):
* a PR to start a 6.25% rollout,
* then a PR for a 12.5% rollout,
* then a PR for a 25% rollout,
* then a PR for a 50% rollout,
* then a PR for a 100% rollout.
The reason for continuing to have more than 1 PR is because we get
valuable feedback both from dashboards about the network and from
the community about how the rollout is doing. We have often stopped
rollouts due to issues discovered by the community or by degraded network
behavior. We would like the default to be that we check first before
we continue the rollout, and not blindly roll it through.
Here is how we will make this work:
Currently, version.storj.io's service takes a configuration for each process
type under management. For each process, the configuration needed is:
* The minimum required version
* The suggested version (for the rollout)
* A rollout seed (see [this design doc](https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md) for details)
* and a target percentage for the rollout
We will be adding two new fields:
* a global "safe rate" value, perhaps the 5% every 6 hours thing.
* the prior percentage for the rollout
When the process serving version.storj.io starts, it will look at its configuration
and keep track of the time since the process started. Whenever a request comes in,
it will calculate the current percent using linear interpolation on the prior
percentage, the target percentage, the time since process start, and the rate.
It will then use that to calculate the rollout cursor.
With the above change, the first rollout would be 6.25%, but
then we would only need to push updates every doubling, while not running afoul
of bumping the cursor too much every 6 hours.
## Other options
One downside with the above approach is it is fairly dependent on the process
runtime. Process restarts will restart that phase of the rollout. However,
the benefit we get with the above design is that the version servers remain
stateless. When a process restarts, what will happen is that that phase of the
configured rollout will start over, on account of the time since process start
clock effectively starting over. If the process restarted and the rollout was
halfway through the 25% to the 50% phase, it will start back at 25% and continue
to 50%. While this isn't a "feature" we would choose, this allows us to remain
purely stateless, and only require the configuration file at the time of
redeployment. We do not need databases of any kind for this design to work.
This is a massive benefit in terms of operational simplicity and potential
failure modes.
We could add a start timestamp to the above config, but then merging
configuration updates require getting reviews and merges before time deadlines
which sounds pretty annoying.
If we are okay adding state to the version server (such as a small database)
then many other options are on the table.
We could have the version server keep track of the current rollout so that
process restarts don't defeat it, but even better, we could get rid of needing
Git commits entirely. If the version server kept state or had a database it
could write to, then we could have an admin interface and manage rollout
status entirely through the version server itself and skip pull requests
entirely.
## Wrapup
## Related work
https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md