This is fixing two small issues with logging:
* repair checker was logging empty last net values as clumped pieces
but in main logic we stopped classifying it this way
* repairer summary log was showing incorrect number of pieces removed
from segment because list contains duplicated entries
There was no real issue here.
Change-Id: Ifc6a83637d621d628598200cad00cce44fa4cbf9
ClassifySegmentPieces uses custom set implementation instead map.
Side note, for custom set implementation I also checked int8 bit set but
it didn't give better performance so I used simpler implementation.
Benchmark results (compared against part 2 optimization change):
name old time/op new time/op delta
RemoteSegment/healthy_segment-8 21.7µs ± 8% 15.4µs ±16% -29.38% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/healthy_segment-8 7.41kB ± 0% 1.87kB ± 0% -74.83% (p=0.000 n=5+4)
name old allocs/op new allocs/op delta
RemoteSegment/healthy_segment-8 150 ± 0% 130 ± 0% -13.33% (p=0.008 n=5+5)
Change-Id: I21feca9ec6ac0a2558ac5ce8894451c54f69e52d
Optimizing collecting monkit metrics:
* initialize metrics once at the begining
* avoid using string in map for getting stats structs per redundancy
Benchmark results (compared against part 1 optimization change):
name old time/op new time/op delta
RemoteSegment/Cockroach/healthy_segment-8 31.4µs ± 6% 21.7µs ± 8% -30.73% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/healthy_segment-8 10.2kB ± 0% 7.4kB ± 0% -27.03% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RemoteSegment/healthy_segment-8 250 ± 0% 150 ± 0% -40.00% (p=0.008 n=5+5)
Change-Id: Ie09476eb469a4d6c09e52550c8ba92b3b4b34271
Currently, graceful exit is a complicated subsystem that keeps a queue
of all pieces expected to be on a node, and asks the node to transfer
those pieces to other nodes one by one. The complexity of the system
has, unfortunately, led to numerous bugs and unexpected behaviors.
We have decided to remove this entire subsystem and restructure graceful
exit as follows:
* Nodes will signal their intent to exit gracefully
* The satellite will not send any new pieces to gracefully exiting nodes
* Pieces on gracefully exiting nodes will be considered by the repair
subsystem as "retrievable but unhealthy". They will be repaired off of
the exiting node as needed.
* After one month (with an appropriately high online score), the node
will be considered exited, and held amounts for the node will be
released. The repair worker will continue to fetch pieces from the
node as long as the node stays online.
* If, at the end of the month, a node's online score is below a certain
threshold, its graceful exit will fail.
Refs: https://github.com/storj/storj/issues/6042
Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b
The repair checker and repair worker both need to determine which pieces
are healthy, which are retrievable, and which should be replaced, but
they have been doing it in different ways in different code, which has
been the cause of bugs. The same term could have very similar but subtly
different meanings between the two, causing much confusion.
With this change, the piece- and node-classification logic is
consolidated into one place within the satellite/repair package, so that
both subsystems can use it. This ought to make decision-making code more
concise and more readable.
The consolidated classification logic has been expanded to create more
sets, so that the decision-making code does not need to do as much
precalculation. It should now be clearer in comments and code that a
piece can belong to multiple sets arbitrarily (except where the
definition of the sets makes this logically impossible), and what the
precise meaning of each set is. These sets include Missing, Suspended,
Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair,
UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy.
Some other side effects of this change:
* CreatePutRepairOrderLimits no longer needs to special-case excluded
countries; it can just create as many order limits as requested (by
way of len(newNodes)).
* The repair checker will now queue a segment for repair when there are
any pieces out of placement. The code calls this "forcing a repair".
* The checker.ReliabilityCache is now accessed by way of a GetNodes()
function similar to the one on the overlay. The classification methods
like MissingPieces(), OutOfPlacementPieces(), and
PiecesNodesLastNetsInOrder() are removed in favor of the
classification logic in satellite/repair/classification.go. This
means the reliability cache no longer needs access to the placement
rules or excluded countries list.
Change-Id: I105109fb94ee126952f07d747c6e11131164fadb
This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation.
(see in satellite/repair/checker/observer.go).
But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go).
Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined
Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd
There are cases when we would like to override the default placement=0 rule.
For example when we would like to exclude tagged nodes from the selection (by default).
Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0.
TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion.
https://github.com/storj/storj/issues/6126
Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.
This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.
Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.
The 90% of this patch is just search and replace:
* we need to use NodeFilters instead of placement.AllowedCountry
* which means, we need an initialized PlacementRules available everywhere
* which means we need to configure the placement rules
The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.
Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder
With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.
https://github.com/storj/storj/issues/5998
Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.
This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.
Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.
Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.
Fixes https://github.com/storj/storj/issues/5895
Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
rather than only logging the last_nets we see in clumpedPieces, this
will run through all the last_nets and log any that have more than one
node. This should have the same outcome, except the counts will be 1
higher (because FindClumpedPieces won't include the first node found in
a clumped network, and this will).
This should be quite a bit faster.
Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.
Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.
Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
Clumped segments (segments with multiple pieces on the same subnet) may
need repair, but the clumped pieces are considered retrievable and we
don't need to call such segments irreparable.
We do want to know where they're coming from, though, if we can, because
we are seeing more than expected.
Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9
We will remove segments loop soon so we need first to move
Segment definition to rangedloop package.
https://github.com/storj/storj/issues/5237
Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363
Additional elements added:
* monkit metric for observers methods like Start/Fork/Join/Finish to
be able to check how much time those methods are taking
* few more logs e.g. entries with processed range
* segmentsProcessed metric to be able to check loop progress
Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec
implemented observer and partial, created new structures to keep mon
metrics remain in same way as in segment loop
Change-Id: I209c126096c84b94d4717332e56238266f6cd004