storj/satellite/repair/priority.go

56 lines
2.4 KiB
Go
Raw Normal View History

// Copyright (C) 2020 Storj Labs, Inc.
// See LICENSE for copying information.
package repair
satellite/repair: use survivability model for segment health The chief segment health models we've come up with are the "immediate danger" model and the "survivability" model. The former calculates the chance of losing a segment becoming lost in the next time period (using the CDF of the binomial distribution to estimate the chance of x nodes failing in that period), while the latter estimates the number of iterations for which a segment can be expected to survive (using the mean of the negative binomial distribution). The immediate danger model was a promising one for comparing segment health across segments with different RS parameters, as it is more precisely what we want to prevent, but it turns out that practically all segments in production have infinite health, as the chance of losing segments with any reasonable estimate of node failure rate is smaller than DBL_EPSILON, the smallest possible difference from 1.0 representable in a float64 (about 1e-16). Leaving aside the wisdom of worrying about the repair of segments that have less than a 1e-16 chance of being lost, we want to be extremely conservative and proactive in our repair efforts, and the health of the segments we have been repairing thus far also evaluates to infinity under the immediate danger model. Thus, we find ourselves reaching for an alternative. Dr. Ben saves the day: the survivability model is a reasonably close approximation of the immediate danger model, and even better, it is far simpler to calculate and yields manageable values for real-world segments. The downside to it is that it requires as input an estimate of the total number of active nodes. This change replaces the segment health calculation to use the survivability model, and reinstates the call to SegmentHealth() where it was reverted. It gets estimates for the total number of active nodes by leveraging the reliability cache. Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-08 04:18:00 +00:00
import "math"
// SegmentHealth returns a value corresponding to the health of a segment in the
// repair queue. Lower health segments should be repaired first.
//
// This calculation purports to find the number of iterations for which a
// segment can be expected to survive, with the given failureRate. The number of
// iterations for the segment to survive (X) can be modeled with the negative
// binomial distribution, with the number of pieces that must be lost as the
// success threshold r, and the chance of losing a single piece in a round as
// the trial success probability p.
//
// First, we calculate the expected number of iterations for a segment to
// survive if we were to lose exactly one node every iteration:
//
// r = numHealthy - minPieces + 1
// p = (totalNodes - numHealthy) / totalNodes
// X ~ NB(r, p)
//
// Then we take the mean of that distribution to use as our expected value,
// which is pr/(1-p).
//
// Finally, to get away from the "one node per iteration" simplification, we
// just scale the magnitude of the iterations in the model so that there really
// is one node being lost. For example, if our failureRate and totalNodes imply
// a churn rate of 3 nodes per day, we just take 1/3 of a day and call that an
// "iteration" for purposes of the model. To convert iterations in the model to
// days, we divide the mean of the negative binomial distribution (X, above) by
// the number of nodes that we estimate will churn in one day.
func SegmentHealth(numHealthy, minPieces, totalNodes int, failureRate float64) float64 {
churnPerRound := float64(totalNodes) * failureRate
if churnPerRound < minChurnPerRound {
// we artificially limit churnPerRound from going too low in cases
// where there are not many nodes, so that health values do not
// start to approach the floating point maximum
churnPerRound = minChurnPerRound
}
p := float64(totalNodes-numHealthy) / float64(totalNodes)
if p == 1.0 {
// floating point precision is insufficient to represent the difference
// from p to 1. there are too many nodes for this model, or else
// numHealthy is 0 somehow. we can't proceed with the normal calculation
// or we will divide by zero.
return math.Inf(1)
}
mean1 := float64(numHealthy-minPieces+1) * p / (1 - p)
return mean1 / churnPerRound
}
satellite/repair: use survivability model for segment health The chief segment health models we've come up with are the "immediate danger" model and the "survivability" model. The former calculates the chance of losing a segment becoming lost in the next time period (using the CDF of the binomial distribution to estimate the chance of x nodes failing in that period), while the latter estimates the number of iterations for which a segment can be expected to survive (using the mean of the negative binomial distribution). The immediate danger model was a promising one for comparing segment health across segments with different RS parameters, as it is more precisely what we want to prevent, but it turns out that practically all segments in production have infinite health, as the chance of losing segments with any reasonable estimate of node failure rate is smaller than DBL_EPSILON, the smallest possible difference from 1.0 representable in a float64 (about 1e-16). Leaving aside the wisdom of worrying about the repair of segments that have less than a 1e-16 chance of being lost, we want to be extremely conservative and proactive in our repair efforts, and the health of the segments we have been repairing thus far also evaluates to infinity under the immediate danger model. Thus, we find ourselves reaching for an alternative. Dr. Ben saves the day: the survivability model is a reasonably close approximation of the immediate danger model, and even better, it is far simpler to calculate and yields manageable values for real-world segments. The downside to it is that it requires as input an estimate of the total number of active nodes. This change replaces the segment health calculation to use the survivability model, and reinstates the call to SegmentHealth() where it was reverted. It gets estimates for the total number of active nodes by leveraging the reliability cache. Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-08 04:18:00 +00:00
const minChurnPerRound = 1e-10