Commit Graph

14 Commits

Author SHA1 Message Date
paul cannon
51c930f532 satellite/repair: extra logging during TestSegmentRepairPlacement
In case we continue to see flaky TestSegmentRepairPlacement, this may
help narrow down the issue.

Change-Id: I34ca70e5bb33eca26e9940e845142121cc946ac0
2023-11-01 20:43:18 +00:00
Michal Niewrzal
f5d717735b satellite/repair: fix checker and repairer logging
This is fixing two small issues with logging:
* repair checker was logging empty last net values as clumped pieces
but in main logic we stopped classifying it this way
* repairer summary log was showing incorrect number of pieces removed
from segment because list contains duplicated entries

There was no real issue here.

Change-Id: Ifc6a83637d621d628598200cad00cce44fa4cbf9
2023-10-25 22:55:53 +00:00
Márton Elek
58b98bc335 satellite/repair: repair is configurable to work only on included/excluded placements
This patch finishes the placement aware repair.

We already introduced the parameters to select only the jobs for specific placements, the remaining part is just to configure the exclude/include rules. + a full e2e unit test.

Change-Id: I223ba84e8ab7481a53e5a444596c7a5ae51573c5
2023-09-27 14:54:06 +00:00
paul cannon
1b8bd6c082 satellite/repair: unify repair logic
The repair checker and repair worker both need to determine which pieces
are healthy, which are retrievable, and which should be replaced, but
they have been doing it in different ways in different code, which has
been the cause of bugs. The same term could have very similar but subtly
different meanings between the two, causing much confusion.

With this change, the piece- and node-classification logic is
consolidated into one place within the satellite/repair package, so that
both subsystems can use it. This ought to make decision-making code more
concise and more readable.

The consolidated classification logic has been expanded to create more
sets, so that the decision-making code does not need to do as much
precalculation. It should now be clearer in comments and code that a
piece can belong to multiple sets arbitrarily (except where the
definition of the sets makes this logically impossible), and what the
precise meaning of each set is. These sets include Missing, Suspended,
Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair,
UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy.

Some other side effects of this change:

* CreatePutRepairOrderLimits no longer needs to special-case excluded
  countries; it can just create as many order limits as requested (by
  way of len(newNodes)).
* The repair checker will now queue a segment for repair when there are
  any pieces out of placement. The code calls this "forcing a repair".
* The checker.ReliabilityCache is now accessed by way of a GetNodes()
  function similar to the one on the overlay. The classification methods
  like MissingPieces(), OutOfPlacementPieces(), and
  PiecesNodesLastNetsInOrder() are removed in favor of the
  classification logic in satellite/repair/classification.go. This
  means the reliability cache no longer needs access to the placement
  rules or excluded countries list.

Change-Id: I105109fb94ee126952f07d747c6e11131164fadb
2023-09-25 09:42:08 -05:00
Márton Elek
98921f9faa satellite/overlay: fix placement selection config parsing
When we do `satellite run api --placement '...'`, the placement rules are not parsed well.

The problem is based on `viper.AllSettings()`, and the main logic is sg. like this (from a new unit test):

```
		r := ConfigurablePlacementRule{}
		err := r.Set(p)
		require.NoError(t, err)
		serialized := r.String()

		r2 := ConfigurablePlacementRule{}
		err = r2.Set(serialized)
		require.NoError(t, err)

		require.Equal(t, p, r2.String())
```

All settings evaluates the placement rules in `ConfigurablePlacementRules` and stores the string representation.

The problem is that we don't have proper `String()` implementation (it prints out the structs instead of the original definition.

There are two main solutions for this problem:

 1. We can fix the `String()`. When we parse a placement rule, the `String()` method should print out the original definition
 2. We can switch to use pure string as configuration parameter, and parse the rules only when required.

I feel that 1 is error prone, we can do it (and in this patch I added a lot of `String()` implementations, but it's hard to be sure that our `String()` logic is inline with the parsing logic.

Therefore I decided to make the configuration value of the placements a string (or a wrapper around string).

That's the main reason why this patch seems to be big, as I updated all the usages.

But the main part is in beginning of the `placement.go` (configuration parsing is not a pflag.Value implementation any more, but a separated step).

And `filter.go`, (a few more String implementation for filters.

https://github.com/storj/storj/issues/6248

Change-Id: I47c762d3514342b76a2e85683b1c891502a0756a
2023-09-21 14:31:41 +00:00
Márton Elek
c202929413
satellite/nodeselection: rename (NodeFilter).MatchInclude to Match
As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful.

But it's not the case any more after 552242387, where we removed the only one stateful filter.

Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083
2023-08-31 16:17:52 +02:00
Márton Elek
de7aabc8c9 satellite/{repair,rangedloop,overlay}: fix node tag placement selection for repair
This patch fixes the node tag based placement of rangedloop/repairchecker + repair process.

The main change is just adding the node tags for Reliable and KnownReliabel database calls + adding new tests to prove, it works.

https://github.com/storj/storj/issues/6126

Change-Id: I245d654a18c1d61b2c72df49afa0718d0de76da1
2023-08-16 15:45:41 +00:00
Michal Niewrzal
47a4d4986d satellite/repair: enable declumping by default
This feature flag was disabled by default to test it slowly. Its enabled
for some time on one production satellite and test satellites without
any issue. We can enable it by default in code.

Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c
2023-07-17 15:02:35 +00:00
Michal Niewrzal
5234727886 satellite/repair/repairer: fix flaky TestSegmentRepairPlacement
Sometimes DownloadSelectionCache doesn't keep up with all node
placement changes we are doing during this test.

Change-Id: Idbda6511e3324b560cee3be85f980bf8d5b9b7ef
2023-07-14 10:10:40 +00:00
Michal Niewrzal
1d62dc63f5 satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.

This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.

Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
2023-07-10 12:01:19 +02:00
Michal Niewrzal
578724e9b1 satellite/repair/repairer: use KnownReliable to check segment pieces
At the moment segment repairer is skipping offline nodes in checks like
clumped pieces and off placement pieces. This change is fixing this
problem using new version of KnownReliable method. New method is
returning both online and offline nodes. Provided data can be used to
find clumped and off placement pieces.

We are not using DownloadSelectionCache anymore with segment repairer.

https://github.com/storj/storj/issues/5998

Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18
2023-06-28 16:53:51 +00:00
Michal Niewrzal
203c6be25f satellite/repair/repairer: test repairing geofenced segment
Additional test case to cover situation where we are trying to
repair segment with specific placement set. We need to be sure
that segment won't be repaired into nodes that are outside
segment placement, even if that means that repair will fail.

Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91
2023-06-22 13:21:05 +00:00
Michal Niewrzal
7c33521ace satellite/repair/repairer: use placement to select nodes for repair upload
We missed to set placement as a part of selection request. It can case
uploading repaired data out of specified placement.

I will provide test as a separate change.

Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1
2023-06-10 20:55:39 +02:00
Michal Niewrzal
128b0a86e3 satellite/repair/repairer: repair pieces out of placement
Segment repairer should take into account segment 'placement' field
and remove or repair pieces from nodes that are outside this placement.

In case when after considering pieces out of placement we are still above
repair threshold we are only updating segment pieces to remove
problematic pieces. Otherwise we are doing regular repair.

https://github.com/storj/storj/issues/5896

Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59
2023-06-05 14:48:36 +00:00