Commit Graph

23 Commits

Author SHA1 Message Date
Michal Niewrzal
f5d717735b satellite/repair: fix checker and repairer logging
This is fixing two small issues with logging:
* repair checker was logging empty last net values as clumped pieces
but in main logic we stopped classifying it this way
* repairer summary log was showing incorrect number of pieces removed
from segment because list contains duplicated entries

There was no real issue here.

Change-Id: Ifc6a83637d621d628598200cad00cce44fa4cbf9
2023-10-25 22:55:53 +00:00
Márton Elek
188aa3011b satellite/repair/checker: report checker_segment_off_placement_count per placement
Change-Id: Ic1639899f8f0b55c4ef8fe246e7efc0a5d9a2bc1
2023-10-19 11:59:56 +00:00
Michal Niewrzal
0eaf43120b satellite/repair/checker: optimize processing, part 3
ClassifySegmentPieces uses custom set implementation instead map.

Side note, for custom set implementation I also checked int8 bit set but
it didn't give better performance so I used simpler implementation.

Benchmark results (compared against part 2 optimization change):
name                                       old time/op    new time/op    delta
RemoteSegment/healthy_segment-8    21.7µs ± 8%    15.4µs ±16%  -29.38%  (p=0.008 n=5+5)

name                                       old alloc/op   new alloc/op   delta
RemoteSegment/healthy_segment-8    7.41kB ± 0%    1.87kB ± 0%  -74.83%  (p=0.000 n=5+4)

name                                       old allocs/op  new allocs/op  delta
RemoteSegment/healthy_segment-8       150 ± 0%       130 ± 0%  -13.33%  (p=0.008 n=5+5)

Change-Id: I21feca9ec6ac0a2558ac5ce8894451c54f69e52d
2023-10-16 12:06:16 +00:00
Michal Niewrzal
e3e303754b satellite/repair/checker: optimize processing, part 2
Optimizing collecting monkit metrics:
* initialize metrics once at the begining
* avoid using string in map for getting stats structs per redundancy

Benchmark results (compared against part 1 optimization change):
name                                       old time/op    new time/op    delta
RemoteSegment/Cockroach/healthy_segment-8    31.4µs ± 6%    21.7µs ± 8%  -30.73%  (p=0.008 n=5+5)

name                                       old alloc/op   new alloc/op   delta
RemoteSegment/healthy_segment-8    10.2kB ± 0%     7.4kB ± 0%  -27.03%  (p=0.008 n=5+5)

name                                       old allocs/op  new allocs/op  delta
RemoteSegment/healthy_segment-8       250 ± 0%       150 ± 0%  -40.00%  (p=0.008 n=5+5)

Change-Id: Ie09476eb469a4d6c09e52550c8ba92b3b4b34271
2023-10-12 10:02:53 +02:00
Michal Niewrzal
de4559d862 satellite/repair/checker: optimize processing, part 1
Optimization by reusing more slices.

Benchmark result:
name                                       old time/op    new time/op    delta
RemoteSegment/healthy_segment-8    33.2µs ± 1%    31.4µs ± 6%   -5.49%  (p=0.032 n=4+5)

name                                       old alloc/op   new alloc/op   delta
RemoteSegment/healthy_segment-8    15.9kB ± 0%    10.2kB ± 0%  -35.92%  (p=0.008 n=5+5)

name                                       old allocs/op  new allocs/op  delta
RemoteSegment/healthy_segment-8       280 ± 0%       250 ± 0%  -10.71%  (p=0.008 n=5+5)

Change-Id: I60462169285462dee6cd16d4f4ce1f30fb6cdfdf
2023-10-11 15:50:29 +00:00
paul cannon
72189330fd satellite/gracefulexit: revamp graceful exit
Currently, graceful exit is a complicated subsystem that keeps a queue
of all pieces expected to be on a node, and asks the node to transfer
those pieces to other nodes one by one. The complexity of the system
has, unfortunately, led to numerous bugs and unexpected behaviors.

We have decided to remove this entire subsystem and restructure graceful
exit as follows:

* Nodes will signal their intent to exit gracefully
* The satellite will not send any new pieces to gracefully exiting nodes
* Pieces on gracefully exiting nodes will be considered by the repair
  subsystem as "retrievable but unhealthy". They will be repaired off of
  the exiting node as needed.
* After one month (with an appropriately high online score), the node
  will be considered exited, and held amounts for the node will be
  released. The repair worker will continue to fetch pieces from the
  node as long as the node stays online.
* If, at the end of the month, a node's online score is below a certain
  threshold, its graceful exit will fail.

Refs: https://github.com/storj/storj/issues/6042
Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b
2023-09-27 08:40:01 +00:00
paul cannon
1b8bd6c082 satellite/repair: unify repair logic
The repair checker and repair worker both need to determine which pieces
are healthy, which are retrievable, and which should be replaced, but
they have been doing it in different ways in different code, which has
been the cause of bugs. The same term could have very similar but subtly
different meanings between the two, causing much confusion.

With this change, the piece- and node-classification logic is
consolidated into one place within the satellite/repair package, so that
both subsystems can use it. This ought to make decision-making code more
concise and more readable.

The consolidated classification logic has been expanded to create more
sets, so that the decision-making code does not need to do as much
precalculation. It should now be clearer in comments and code that a
piece can belong to multiple sets arbitrarily (except where the
definition of the sets makes this logically impossible), and what the
precise meaning of each set is. These sets include Missing, Suspended,
Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair,
UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy.

Some other side effects of this change:

* CreatePutRepairOrderLimits no longer needs to special-case excluded
  countries; it can just create as many order limits as requested (by
  way of len(newNodes)).
* The repair checker will now queue a segment for repair when there are
  any pieces out of placement. The code calls this "forcing a repair".
* The checker.ReliabilityCache is now accessed by way of a GetNodes()
  function similar to the one on the overlay. The classification methods
  like MissingPieces(), OutOfPlacementPieces(), and
  PiecesNodesLastNetsInOrder() are removed in favor of the
  classification logic in satellite/repair/classification.go. This
  means the reliability cache no longer needs access to the placement
  rules or excluded countries list.

Change-Id: I105109fb94ee126952f07d747c6e11131164fadb
2023-09-25 09:42:08 -05:00
Márton Elek
b4fdc49194 satellite/repair/checker: persist placement information to the queue
Change-Id: I51c7fd5a2a38f9f6620c16eddaed3b4915ffd792
2023-09-25 09:33:46 +00:00
Márton Elek
84ea80c1fd satellite/repair/checker: respect autoExcludeSubnet anntation in checker rangedloop
This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation.
(see in satellite/repair/checker/observer.go).

But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go).

Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined

Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd
2023-08-23 13:45:09 +00:00
Márton Elek
da08117fcd satellite/~placement: do not ignore placement check for placement=0
There are cases when we would like to override the default placement=0 rule.

For example when we would like to exclude tagged nodes from the selection (by default).

Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0.

TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion.

https://github.com/storj/storj/issues/6126

Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4
2023-08-16 07:06:56 +00:00
Michal Niewrzal
1d62dc63f5 satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.

This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.

Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
2023-07-10 12:01:19 +02:00
Márton Elek
97a89c3476 satellite: switch to use nodefilters instead of old placement.AllowedCountry
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.

The 90% of this patch is just search and replace:

 * we need to use NodeFilters instead of placement.AllowedCountry
 * which means, we need an initialized PlacementRules available everywhere
 * which means we need to configure the placement rules

The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.

Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
2023-07-07 16:55:45 +00:00
Michal Niewrzal
21c1e66a85 satellite/overlay: refactor ReliabilityCache to keep more data
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder

With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.

https://github.com/storj/storj/issues/5998

Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
2023-07-05 11:19:10 +02:00
Michal Niewrzal
f2cd7b0928 satellite/overlay: refactor Reliable to be used with repair checker
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.

Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.

We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.

This this first part of changes.

https://github.com/storj/storj/issues/5998

Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
2023-07-05 10:56:31 +02:00
paul cannon
25a5df9752 satellite/repair: don't reuse allNodeIDs
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.

This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.

Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.

Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
2023-06-06 10:28:04 -05:00
Michal Niewrzal
337eb9be6a satellite/repair/checker: put into queue segment off placement
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.

Fixes https://github.com/storj/storj/issues/5895

Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
2023-06-05 15:53:49 +00:00
paul cannon
3dc01bd25d satellite/repair: change how we log clumped pieces
rather than only logging the last_nets we see in clumpedPieces, this
will run through all the last_nets and log any that have more than one
node. This should have the same outcome, except the counts will be 1
higher (because FindClumpedPieces won't include the first node found in
a clumped network, and this will).

This should be quite a bit faster.

Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a
2023-05-19 10:38:50 +02:00
paul cannon
de737bdee9 satellite/repair: add flag for de-clumping behavior
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.

Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.

Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
2023-05-18 21:02:36 +00:00
paul cannon
1f4f79b6b3 satellite/repair: don't mark clumped segments as irreparable
Clumped segments (segments with multiple pieces on the same subnet) may
need repair, but the clumped pieces are considered retrievable and we
don't need to call such segments irreparable.

We do want to know where they're coming from, though, if we can, because
we are seeing more than expected.

Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9
2023-05-17 16:24:15 +00:00
Michal Niewrzal
4bdbb25d83 satellite/metabase/rangedloop: move Segment definition
We will remove segments loop soon so we need first to move
Segment definition to rangedloop package.

https://github.com/storj/storj/issues/5237

Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363
2023-05-16 12:37:17 +00:00
Michal Niewrzal
36e046375c satellite/repair/checker: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3
2023-05-08 12:19:13 +00:00
Michal Niewrzal
aba2f14595 satellite/metabase/rangedloop: few additions for monitoring
Additional elements added:
* monkit metric for observers methods like Start/Fork/Join/Finish to
be able to check how much time those methods are taking
* few more logs e.g. entries with processed range
* segmentsProcessed metric to be able to check loop progress

Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec
2023-02-17 08:46:00 +00:00
Qweder93
d6a948f59d satellite/repair : implemented ranged loop observer
implemented observer and partial, created new structures to keep mon
metrics remain in same way as in segment loop

Change-Id: I209c126096c84b94d4717332e56238266f6cd004
2023-01-23 14:23:03 +00:00