docs/blueprint: partial rollout automation

Change-Id: I355d7071da7e501948665ee153ed641722a3d264
2023-04-17 17:09:10 -04:00 · 2023-04-17 17:09:10 -04:00 · e90612c285
commit e90612c285
parent 2592aaef9c
1 changed files with 124 additions and 0 deletions
--- a/docs/blueprints/rollout-automation.md
+++ b/docs/blueprints/rollout-automation.md
@ -0,0 +1,124 @@
+# Partial rollout automation
+
+## Abstract
+
+This is a proposal to partially automate some of the version rollout process.
+The version rollout process right now, for better or worse, is taking manual
+action every 5%. This is bad so we want to fix it.
+
+## Background/context
+
+We have a bunch of different problems around safely keeping storage nodes
+up to date in a safe way. We want to:
+
+ * Make sure that storage nodes are not running stale or old code.
+ * Make sure that new code doesn't break many nodes at once.
+
+These two goals are at a bit of a tension - if we automatically update all nodes
+immediately, then we risk an update breaking a significant portion of the
+network.
+
+We landed on this design:
+https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md
+This design allows us to control what percent of the network is eligible for an
+update. The design of this system intended to have an exponentially increasing
+rollout scheme, where storage node updates would upgrade a small percentage
+of the network, maybe 5%, and then see what happens. If that 5% looked good,
+then maybe we'd double it, and so on, until the network was upgraded.
+
+The upside to this exponential scheme is that humans are only involved at growth
+points just to double check that the network hasn't fallen over. This is an
+important feature and something we want to preserve.
+
+The downside of this scheme is that the later stages of the rollout process
+potentially involve lots of nodes upgrading at the same moment. Because storage
+node operators are often eager to get the latest update when it is available
+to them, we potentially risk having half or more of the network down for an
+otherwise safe upgrade at a time.
+
+The data science team suggested we don't do an upgrade increment larger than 5%
+every 6 hours.
+
+In practice this means that we are now pushing the rollout along at 5%
+increments every rollout, and it's tiring.
+
+## Design and implementation
+
+The intention of the below design is to resolve the data science team's
+concern, while still allowing us to do exponentially increasing rollouts.
+This design still intentionally requires multiple PRs per rollout, but
+instead of one every 5% (20 PRs), we would have just 5 (or maybe
+slightly less, but not 1):
+
+ * a PR to start a 6.25% rollout,
+ * then a PR for a 12.5% rollout,
+ * then a PR for a 25% rollout,
+ * then a PR for a 50% rollout,
+ * then a PR for a 100% rollout.
+
+The reason for continuing to have more than 1 PR is because we get
+valuable feedback both from dashboards about the network and from
+the community about how the rollout is doing. We have often stopped
+rollouts due to issues discovered by the community or by degraded network
+behavior. We would like the default to be that we check first before
+we continue the rollout, and not blindly roll it through.
+
+Here is how we will make this work:
+
+Currently, version.storj.io's service takes a configuration for each process
+type under management. For each process, the configuration needed is:
+
+ * The minimum required version
+ * The suggested version (for the rollout)
+ * A rollout seed (see [this design doc](https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md) for details)
+ * and a target percentage for the rollout
+
+We will be adding two new fields:
+
+ * a global "safe rate" value, perhaps the 5% every 6 hours thing.
+ * the prior percentage for the rollout
+
+When the process serving version.storj.io starts, it will look at its configuration
+and keep track of the time since the process started. Whenever a request comes in,
+it will calculate the current percent using linear interpolation on the prior
+percentage, the target percentage, the time since process start, and the rate.
+It will then use that to calculate the rollout cursor.
+
+With the above change, the first rollout would be 6.25%, but
+then we would only need to push updates every doubling, while not running afoul
+of bumping the cursor too much every 6 hours.
+
+## Other options
+
+One downside with the above approach is it is fairly dependent on the process
+runtime. Process restarts will restart that phase of the rollout. However,
+the benefit we get with the above design is that the version servers remain
+stateless. When a process restarts, what will happen is that that phase of the
+configured rollout will start over, on account of the time since process start
+clock effectively starting over. If the process restarted and the rollout was
+halfway through the 25% to the 50% phase, it will start back at 25% and continue
+to 50%. While this isn't a "feature" we would choose, this allows us to remain
+purely stateless, and only require the configuration file at the time of 
+redeployment. We do not need databases of any kind for this design to work. 
+This is a massive benefit in terms of operational simplicity and potential
+failure modes.
+
+We could add a start timestamp to the above config, but then merging
+configuration updates require getting reviews and merges before time deadlines
+which sounds pretty annoying.
+
+If we are okay adding state to the version server (such as a small database)
+then many other options are on the table.
+
+We could have the version server keep track of the current rollout so that
+process restarts don't defeat it, but even better, we could get rid of needing
+Git commits entirely. If the version server kept state or had a database it
+could write to, then we could have an admin interface and manage rollout
+status entirely through the version server itself and skip pull requests
+entirely.
+
+## Wrapup
+
+## Related work
+
+https://github.com/storj/storj/blob/main/docs/blueprints/storage-node-automatic-updater.md