The repair checker and repair worker both need to determine which pieces
are healthy, which are retrievable, and which should be replaced, but
they have been doing it in different ways in different code, which has
been the cause of bugs. The same term could have very similar but subtly
different meanings between the two, causing much confusion.
With this change, the piece- and node-classification logic is
consolidated into one place within the satellite/repair package, so that
both subsystems can use it. This ought to make decision-making code more
concise and more readable.
The consolidated classification logic has been expanded to create more
sets, so that the decision-making code does not need to do as much
precalculation. It should now be clearer in comments and code that a
piece can belong to multiple sets arbitrarily (except where the
definition of the sets makes this logically impossible), and what the
precise meaning of each set is. These sets include Missing, Suspended,
Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair,
UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy.
Some other side effects of this change:
* CreatePutRepairOrderLimits no longer needs to special-case excluded
countries; it can just create as many order limits as requested (by
way of len(newNodes)).
* The repair checker will now queue a segment for repair when there are
any pieces out of placement. The code calls this "forcing a repair".
* The checker.ReliabilityCache is now accessed by way of a GetNodes()
function similar to the one on the overlay. The classification methods
like MissingPieces(), OutOfPlacementPieces(), and
PiecesNodesLastNetsInOrder() are removed in favor of the
classification logic in satellite/repair/classification.go. This
means the reliability cache no longer needs access to the placement
rules or excluded countries list.
Change-Id: I105109fb94ee126952f07d747c6e11131164fadb
as GetParticipatingNodes and GetNodes, respectively.
We now want these functions to include offline and suspended nodes as
well, so that we can force immediate repair when pieces are out of
placement or in excluded countries. With that change, the old names no
longer made sense.
Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7
As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful.
But it's not the case any more after 552242387, where we removed the only one stateful filter.
Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083
There are cases when we would like to override the default placement=0 rule.
For example when we would like to exclude tagged nodes from the selection (by default).
Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0.
TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion.
https://github.com/storj/storj/issues/6126
Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.
The 90% of this patch is just search and replace:
* we need to use NodeFilters instead of placement.AllowedCountry
* which means, we need an initialized PlacementRules available everywhere
* which means we need to configure the placement rules
The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.
Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder
With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.
https://github.com/storj/storj/issues/5998
Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
It looks that monikt monitoring can give high CPU overhead for
segments loop observer. With this code we are changing how monitoring
is initialized for observer methods. This optimization affects mainly
path where segment is healthy and doesn't require repair. Benchmark
is also added to show difference between old and new approach.
Benchmark against 'main':
name old time/op new time/op delta
RemoteSegment/Cockroach/healthy_segment-8 8.55µs ± 4% 1.37µs ± 6% -84.03% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/Cockroach/healthy_segment-8 2.63kB ± 0% 0.17kB ± 0% -93.62% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RemoteSegment/Cockroach/healthy_segment-8 54.0 ± 0% 8.0 ± 0% -85.19% (p=0.008 n=5+5)
Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).
Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.
Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.
This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.
Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair