docs/blueprint: partial rollout automation
Change-Id: I355d7071da7e501948665ee153ed641722a3d264
This commit is contained in:
parent
2592aaef9c
commit
e90612c285
124
docs/blueprints/rollout-automation.md
Normal file
124
docs/blueprints/rollout-automation.md
Normal file
@ -0,0 +1,124 @@
|
||||
# Partial rollout automation
|
||||
|
||||
## Abstract
|
||||
|
||||
This is a proposal to partially automate some of the version rollout process.
|
||||
The version rollout process right now, for better or worse, is taking manual
|
||||
action every 5%. This is bad so we want to fix it.
|
||||
|
||||
## Background/context
|
||||
|
||||
We have a bunch of different problems around safely keeping storage nodes
|
||||
up to date in a safe way. We want to:
|
||||
|
||||
* Make sure that storage nodes are not running stale or old code.
|
||||
* Make sure that new code doesn't break many nodes at once.
|
||||
|
||||
These two goals are at a bit of a tension - if we automatically update all nodes
|
||||
immediately, then we risk an update breaking a significant portion of the
|
||||
network.
|
||||
|
||||
We landed on this design:
|
||||
https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md
|
||||
This design allows us to control what percent of the network is eligible for an
|
||||
update. The design of this system intended to have an exponentially increasing
|
||||
rollout scheme, where storage node updates would upgrade a small percentage
|
||||
of the network, maybe 5%, and then see what happens. If that 5% looked good,
|
||||
then maybe we'd double it, and so on, until the network was upgraded.
|
||||
|
||||
The upside to this exponential scheme is that humans are only involved at growth
|
||||
points just to double check that the network hasn't fallen over. This is an
|
||||
important feature and something we want to preserve.
|
||||
|
||||
The downside of this scheme is that the later stages of the rollout process
|
||||
potentially involve lots of nodes upgrading at the same moment. Because storage
|
||||
node operators are often eager to get the latest update when it is available
|
||||
to them, we potentially risk having half or more of the network down for an
|
||||
otherwise safe upgrade at a time.
|
||||
|
||||
The data science team suggested we don't do an upgrade increment larger than 5%
|
||||
every 6 hours.
|
||||
|
||||
In practice this means that we are now pushing the rollout along at 5%
|
||||
increments every rollout, and it's tiring.
|
||||
|
||||
## Design and implementation
|
||||
|
||||
The intention of the below design is to resolve the data science team's
|
||||
concern, while still allowing us to do exponentially increasing rollouts.
|
||||
This design still intentionally requires multiple PRs per rollout, but
|
||||
instead of one every 5% (20 PRs), we would have just 5 (or maybe
|
||||
slightly less, but not 1):
|
||||
|
||||
* a PR to start a 6.25% rollout,
|
||||
* then a PR for a 12.5% rollout,
|
||||
* then a PR for a 25% rollout,
|
||||
* then a PR for a 50% rollout,
|
||||
* then a PR for a 100% rollout.
|
||||
|
||||
The reason for continuing to have more than 1 PR is because we get
|
||||
valuable feedback both from dashboards about the network and from
|
||||
the community about how the rollout is doing. We have often stopped
|
||||
rollouts due to issues discovered by the community or by degraded network
|
||||
behavior. We would like the default to be that we check first before
|
||||
we continue the rollout, and not blindly roll it through.
|
||||
|
||||
Here is how we will make this work:
|
||||
|
||||
Currently, version.storj.io's service takes a configuration for each process
|
||||
type under management. For each process, the configuration needed is:
|
||||
|
||||
* The minimum required version
|
||||
* The suggested version (for the rollout)
|
||||
* A rollout seed (see [this design doc](https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md) for details)
|
||||
* and a target percentage for the rollout
|
||||
|
||||
We will be adding two new fields:
|
||||
|
||||
* a global "safe rate" value, perhaps the 5% every 6 hours thing.
|
||||
* the prior percentage for the rollout
|
||||
|
||||
When the process serving version.storj.io starts, it will look at its configuration
|
||||
and keep track of the time since the process started. Whenever a request comes in,
|
||||
it will calculate the current percent using linear interpolation on the prior
|
||||
percentage, the target percentage, the time since process start, and the rate.
|
||||
It will then use that to calculate the rollout cursor.
|
||||
|
||||
With the above change, the first rollout would be 6.25%, but
|
||||
then we would only need to push updates every doubling, while not running afoul
|
||||
of bumping the cursor too much every 6 hours.
|
||||
|
||||
## Other options
|
||||
|
||||
One downside with the above approach is it is fairly dependent on the process
|
||||
runtime. Process restarts will restart that phase of the rollout. However,
|
||||
the benefit we get with the above design is that the version servers remain
|
||||
stateless. When a process restarts, what will happen is that that phase of the
|
||||
configured rollout will start over, on account of the time since process start
|
||||
clock effectively starting over. If the process restarted and the rollout was
|
||||
halfway through the 25% to the 50% phase, it will start back at 25% and continue
|
||||
to 50%. While this isn't a "feature" we would choose, this allows us to remain
|
||||
purely stateless, and only require the configuration file at the time of
|
||||
redeployment. We do not need databases of any kind for this design to work.
|
||||
This is a massive benefit in terms of operational simplicity and potential
|
||||
failure modes.
|
||||
|
||||
We could add a start timestamp to the above config, but then merging
|
||||
configuration updates require getting reviews and merges before time deadlines
|
||||
which sounds pretty annoying.
|
||||
|
||||
If we are okay adding state to the version server (such as a small database)
|
||||
then many other options are on the table.
|
||||
|
||||
We could have the version server keep track of the current rollout so that
|
||||
process restarts don't defeat it, but even better, we could get rid of needing
|
||||
Git commits entirely. If the version server kept state or had a database it
|
||||
could write to, then we could have an admin interface and manage rollout
|
||||
status entirely through the version server itself and skip pull requests
|
||||
entirely.
|
||||
|
||||
## Wrapup
|
||||
|
||||
## Related work
|
||||
|
||||
https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md
|
Loading…
Reference in New Issue
Block a user