Commit Graph

4 Commits

Author SHA1 Message Date
Egon Elbre
566fc8ee25 satellite/repair: test inmemory/disk difference only once
We don't need to have every single test for both, only one for
each should be sufficient. For all other tests it doesn't matter
which one we use.

Change-Id: I9962206a4ee025d367332c29ea3e6bc9f0f9a1de
2022-03-29 14:08:13 +03:00
paul cannon
d3604a5e90 satellite/repair: use survivability model for segment health
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).

Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.

Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.

This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.

Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-17 21:30:17 +00:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
paul cannon
3e56403599 satellite/repair: add a repair health function
This will be used to rank segments in need of repair for attention by
the repair workers.

Change-Id: I5b70650cec933696b4c6d73bb7efb97e3efdf24a
2020-11-11 18:48:51 +00:00