Commit Graph

128 Commits

Author SHA1 Message Date
Márton Elek
c44e3d78d8 satellite/satellitedb: repairqueue.Select uses placement constraints
Change-Id: I59739926f8f6c5eaca3199369d4c5d88a9c08be8
2023-09-25 10:14:25 +00:00
Márton Elek
b4fdc49194 satellite/repair/checker: persist placement information to the queue
Change-Id: I51c7fd5a2a38f9f6620c16eddaed3b4915ffd792
2023-09-25 09:33:46 +00:00
Márton Elek
98921f9faa satellite/overlay: fix placement selection config parsing
When we do `satellite run api --placement '...'`, the placement rules are not parsed well.

The problem is based on `viper.AllSettings()`, and the main logic is sg. like this (from a new unit test):

```
		r := ConfigurablePlacementRule{}
		err := r.Set(p)
		require.NoError(t, err)
		serialized := r.String()

		r2 := ConfigurablePlacementRule{}
		err = r2.Set(serialized)
		require.NoError(t, err)

		require.Equal(t, p, r2.String())
```

All settings evaluates the placement rules in `ConfigurablePlacementRules` and stores the string representation.

The problem is that we don't have proper `String()` implementation (it prints out the structs instead of the original definition.

There are two main solutions for this problem:

 1. We can fix the `String()`. When we parse a placement rule, the `String()` method should print out the original definition
 2. We can switch to use pure string as configuration parameter, and parse the rules only when required.

I feel that 1 is error prone, we can do it (and in this patch I added a lot of `String()` implementations, but it's hard to be sure that our `String()` logic is inline with the parsing logic.

Therefore I decided to make the configuration value of the placements a string (or a wrapper around string).

That's the main reason why this patch seems to be big, as I updated all the usages.

But the main part is in beginning of the `placement.go` (configuration parsing is not a pflag.Value implementation any more, but a separated step).

And `filter.go`, (a few more String implementation for filters.

https://github.com/storj/storj/issues/6248

Change-Id: I47c762d3514342b76a2e85683b1c891502a0756a
2023-09-21 14:31:41 +00:00
Michal Niewrzal
5d0934e4d9 satellite/metabase/rangedloop: disable ranged loop for tests
Currently each testplanet test is running ranged loop no matter if
it's used or not. This is small change with some benefits like:
* saves some cpu cycles
* less log entries
* ranged loop won't interfere with other systems

Change have no big impact on tests execration but I believe it's nice to
have.

Change-Id: I731846bf625cac47ed4f3ca3bc1d1a4659bdcce8
2023-09-18 22:51:10 +00:00
Márton Elek
f40baf8629
go.mod: bump dependencies (private,uplink,common)
Change-Id: I6c55735b45cadaf36697eff53e78b5b09afe9dea
2023-09-06 13:28:22 +02:00
Márton Elek
e2006d821c satellite/overlay: change Reliable and KnownReliable
as GetParticipatingNodes and GetNodes, respectively.

We now want these functions to include offline and suspended nodes as
well, so that we can force immediate repair when pieces are out of
placement or in excluded countries. With that change, the old names no
longer made sense.

Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7
2023-09-02 23:34:50 +00:00
Márton Elek
c202929413
satellite/nodeselection: rename (NodeFilter).MatchInclude to Match
As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful.

But it's not the case any more after 552242387, where we removed the only one stateful filter.

Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083
2023-08-31 16:17:52 +02:00
Márton Elek
84ea80c1fd satellite/repair/checker: respect autoExcludeSubnet anntation in checker rangedloop
This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation.
(see in satellite/repair/checker/observer.go).

But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go).

Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined

Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd
2023-08-23 13:45:09 +00:00
Márton Elek
da08117fcd satellite/~placement: do not ignore placement check for placement=0
There are cases when we would like to override the default placement=0 rule.

For example when we would like to exclude tagged nodes from the selection (by default).

Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0.

TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion.

https://github.com/storj/storj/issues/6126

Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4
2023-08-16 07:06:56 +00:00
Michal Niewrzal
7be844351d satellite/metainfo: remove ServerSideCopyDuplicateMetadata
https://github.com/storj/storj/issues/5891

Change-Id: Ib5440169107acca6e832c2280e1ad12dfd380f28
2023-08-08 12:15:10 +00:00
Michal Niewrzal
47a4d4986d satellite/repair: enable declumping by default
This feature flag was disabled by default to test it slowly. Its enabled
for some time on one production satellite and test satellites without
any issue. We can enable it by default in code.

Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c
2023-07-17 15:02:35 +00:00
Michal Niewrzal
1d62dc63f5 satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.

This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.

Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
2023-07-10 12:01:19 +02:00
Márton Elek
97a89c3476 satellite: switch to use nodefilters instead of old placement.AllowedCountry
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.

The 90% of this patch is just search and replace:

 * we need to use NodeFilters instead of placement.AllowedCountry
 * which means, we need an initialized PlacementRules available everywhere
 * which means we need to configure the placement rules

The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.

Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
2023-07-07 16:55:45 +00:00
Márton Elek
70cdca5d3c
satellite: move satellite/nodeselection/uploadselection => satellite/nodeselection
All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload,
but for download, repair, etc...

Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab
2023-07-07 10:32:03 +02:00
Michal Niewrzal
21c1e66a85 satellite/overlay: refactor ReliabilityCache to keep more data
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder

With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.

https://github.com/storj/storj/issues/5998

Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
2023-07-05 11:19:10 +02:00
Michal Niewrzal
f2cd7b0928 satellite/overlay: refactor Reliable to be used with repair checker
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.

Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.

We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.

This this first part of changes.

https://github.com/storj/storj/issues/5998

Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
2023-07-05 10:56:31 +02:00
paul cannon
25a5df9752 satellite/repair: don't reuse allNodeIDs
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.

This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.

Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.

Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
2023-06-06 10:28:04 -05:00
Michal Niewrzal
337eb9be6a satellite/repair/checker: put into queue segment off placement
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.

Fixes https://github.com/storj/storj/issues/5895

Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
2023-06-05 15:53:49 +00:00
paul cannon
3dc01bd25d satellite/repair: change how we log clumped pieces
rather than only logging the last_nets we see in clumpedPieces, this
will run through all the last_nets and log any that have more than one
node. This should have the same outcome, except the counts will be 1
higher (because FindClumpedPieces won't include the first node found in
a clumped network, and this will).

This should be quite a bit faster.

Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a
2023-05-19 10:38:50 +02:00
paul cannon
c856d45cc0 satellite/overlay: fix GetNodesNetworkInOrder
We were using the UploadSelectionCache previously, which does _not_ have
all nodes, or even all online nodes, in it. So all nodes with less than
MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended
for unknown audit errors, or nodes that have started graceful exit, were
all missing, and ended up having empty last_nets. Even with all that,
I'm kind of surprised how many nodes this involved, but using the upload
selection cache was definitely wrong.

This change uses the download selection cache instead, which excludes
nodes only when they are disqualified, gracefully exited (completely),
or offline.

Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1
2023-05-19 08:08:08 +00:00
paul cannon
de737bdee9 satellite/repair: add flag for de-clumping behavior
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.

Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.

Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
2023-05-18 21:02:36 +00:00
paul cannon
958d8676d0 satellite/overlay: remove unnecessary test helper
Change-Id: I8439eec4ed440f60353fc620ca906a917a03613c
2023-05-17 17:04:54 +00:00
paul cannon
1f4f79b6b3 satellite/repair: don't mark clumped segments as irreparable
Clumped segments (segments with multiple pieces on the same subnet) may
need repair, but the clumped pieces are considered retrievable and we
don't need to call such segments irreparable.

We do want to know where they're coming from, though, if we can, because
we are seeing more than expected.

Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9
2023-05-17 16:24:15 +00:00
paul cannon
75d10fe4fa satellite/overlay: use UploadSelectionCache for GetNodesNetworkInOrder
The query for GetNodesNetworkInOrder is causing far too much load on the
database. Since it is not critical that the repair checker have
perfectly up-to-date node network information, we can use a cache
instead.

Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878
2023-05-16 17:32:09 +00:00
Michal Niewrzal
4bdbb25d83 satellite/metabase/rangedloop: move Segment definition
We will remove segments loop soon so we need first to move
Segment definition to rangedloop package.

https://github.com/storj/storj/issues/5237

Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363
2023-05-16 12:37:17 +00:00
Michal Niewrzal
36e046375c satellite/repair/checker: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3
2023-05-08 12:19:13 +00:00
paul cannon
915f3952af satellite/repair: repair pieces on the same last_net
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.

This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).

Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
2023-04-06 17:34:25 +00:00
Márton Elek
ffaf15a3b0 satellite/overlay: remove unused mail service from overlay
It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail.

Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service.

Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7
2023-03-09 12:17:35 +00:00
Michal Niewrzal
aba2f14595 satellite/metabase/rangedloop: few additions for monitoring
Additional elements added:
* monkit metric for observers methods like Start/Fork/Join/Finish to
be able to check how much time those methods are taking
* few more logs e.g. entries with processed range
* segmentsProcessed metric to be able to check loop progress

Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec
2023-02-17 08:46:00 +00:00
Qweder93
d6a948f59d satellite/repair : implemented ranged loop observer
implemented observer and partial, created new structures to keep mon
metrics remain in same way as in segment loop

Change-Id: I209c126096c84b94d4717332e56238266f6cd004
2023-01-23 14:23:03 +00:00
Cameron
f06da25c3d satellite/overlay: add nodeevents.DB to satellite overlay service
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.

Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
2022-11-02 15:56:37 +00:00
Cameron
a52f766273 satellite/overlay: add email-sending functionality to overlay service
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails

Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
2022-10-13 18:01:05 +00:00
Michal Niewrzal
5dc5f076c9 satellite/repair/checker: remove monitoring from fast methods
It looks that monikt monitoring can give high CPU overhead for
segments loop observer. With this code we are changing how monitoring
is initialized for observer methods. This optimization affects mainly
path where segment is healthy and doesn't require repair. Benchmark
is also added to show difference between old and new approach.

Benchmark against 'main':
name                                       old time/op    new time/op    delta
RemoteSegment/Cockroach/healthy_segment-8    8.55µs ± 4%    1.37µs ± 6%  -84.03%  (p=0.008 n=5+5)

name                                       old alloc/op   new alloc/op   delta
RemoteSegment/Cockroach/healthy_segment-8    2.63kB ± 0%    0.17kB ± 0%  -93.62%  (p=0.008 n=5+5)

name                                       old allocs/op  new allocs/op  delta
RemoteSegment/Cockroach/healthy_segment-8      54.0 ± 0%       8.0 ± 0%  -85.19%  (p=0.008 n=5+5)

Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18
2022-10-03 12:15:03 +00:00
Michal Niewrzal
1aecca1e76 satellite/repair/checker: tiny cleanup
* unused slice removed
* variable moved closer to place of use

Change-Id: I86126b8337225d4b31cabf89bc9640add7409398
2022-09-26 11:20:10 +00:00
Michal Niewrzal
6cc2052f47 satellite: fix segment loop observers metrics
We made optimization for segment loop observers to avoid
heavy monkit initialization on each call. It was applied to very
often executed methods. Unfortunately we used wrong monkit
method to track function times. Instead mon.Task we used
mon.Func().

https://github.com/spacemonkeygo/monkit#how-it-works

Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8
2022-08-10 14:13:16 +00:00
Egon Elbre
48b0a65fbd satellite/overlay: use ReadCache in Download/UploadSelectionCache
sync2.ReadCache implements preemptive refreshing preventing stalling
while it's being updated.

Change-Id: Iee9ef36049b986f0e426c14a139b2bc9ac17fb53
2022-07-12 13:52:48 +03:00
Michał Niewrzał
7a2d2a36ca satellite: use more optimal monkit call for loop observers methods
Recently we applied this optimization to metrics observer and time
used by its method dropped from 12m to 3m for us1 (220m segments).
It looks that it make sense to apply the same code to all observers.

Change-Id: I05898aaacbd9bcdf21babc7be9955da1db57bdf2
2022-05-20 11:03:41 +00:00
Erik van Velzen
db1cc8ca95 satellite/repair/checker: buffer repair queue
Integrate previous changes. Speed up the segment loop by batch inserting
into repair queue.

Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276
2022-05-12 16:28:05 +00:00
Qweder93
8b0988708a satellite/repair: add test that confirms that repairer is ignoring copied segments
Resolves https://github.com/storj/storj/issues/4485

Change-Id: Ic772643520124fe3f7eacf8b3bfbbb38982d4769
2022-03-16 09:00:34 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Fadila Khadar
e776c65172 satellite/checker: pieces in excluded countries are not healthy
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.

Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.

Another change will handle the logic at the repairer level.

Fixes https://github.com/storj/team-metainfo/issues/95

Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
2022-03-02 09:59:09 +00:00
Michał Niewrzał
bc161794fc satellite/metabase: drop DeleteObjectLatestVersion method
This method was never used, except tests.

Change-Id: Idc1e69b2e2971995b5c4e6cf78a2b5fc69f39ad2
2022-02-02 14:33:48 +00:00
Michał Niewrzał
1fdb0eaa5b Revert "satellite/metabase: use storj.Nonce instead []byte"
This change introduce problems with server side move so
let's revert it for now. Problem was found when latest
version of storj/storj was used in uplink tests.

This reverts commit 1ef06fae99.

Change-Id: I4d4fad5d1ea04ba15ff9d7bd765f7e078e9187c2
2021-10-12 15:39:54 +02:00
Michał Niewrzał
1ef06fae99 satellite/metabase: use storj.Nonce instead []byte
We were using mixed types for nonce fields. Protobuf
have storj.Nonce, metabase have []byte. This change
is a refactoring to have everywere its possible only
storj.Nonce.

Change-Id: Id54bd8481f30c721cdaf3df79206d25e7cfdab55
2021-10-11 16:13:34 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Cameron Ayer
a7cda642a5 satellite/repair: add logging for irreparable segments in checker
If the checker sees an irreparable segment, log out some info
so we can see what the problem is

Change-Id: I76eda5270214205f4fefc646d6c391cc13ddcafd
2021-09-02 12:35:29 -04:00
Clement Sam
1f353f3231 segment/{metabase,repair}: change segment created_at column to not accept nulls
This change adds a NOT NULL constraint to the created_at column in the segment table.
All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc)

Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52
2021-08-06 08:16:28 +00:00
Michał Niewrzał
0d8e7905c1 satellite/repair/checker: don't return error when joining loop
Error from joining loop should not restart satellite. This will be the
same error like for loop itself. In the same way we are handling joining
error for other services that are using segment loop.

Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e
2021-08-03 12:56:42 +00:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Michał Niewrzał
a883d7f582 satellite/repair/checker: fix remote_files_checked metric
While metaloop refactoring we missed metric for all
objects processed by repair checker.

Change-Id: I100f10a36c52e2651923ecaa377261752877d673
2021-07-22 14:48:08 +00:00