storj

Author	SHA1	Message	Date
paul cannon	72189330fd	satellite/gracefulexit: revamp graceful exit Currently, graceful exit is a complicated subsystem that keeps a queue of all pieces expected to be on a node, and asks the node to transfer those pieces to other nodes one by one. The complexity of the system has, unfortunately, led to numerous bugs and unexpected behaviors. We have decided to remove this entire subsystem and restructure graceful exit as follows: * Nodes will signal their intent to exit gracefully * The satellite will not send any new pieces to gracefully exiting nodes * Pieces on gracefully exiting nodes will be considered by the repair subsystem as "retrievable but unhealthy". They will be repaired off of the exiting node as needed. * After one month (with an appropriately high online score), the node will be considered exited, and held amounts for the node will be released. The repair worker will continue to fetch pieces from the node as long as the node stays online. * If, at the end of the month, a node's online score is below a certain threshold, its graceful exit will fail. Refs: https://github.com/storj/storj/issues/6042 Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b	2023-09-27 08:40:01 +00:00
paul cannon	1b8bd6c082	satellite/repair: unify repair logic The repair checker and repair worker both need to determine which pieces are healthy, which are retrievable, and which should be replaced, but they have been doing it in different ways in different code, which has been the cause of bugs. The same term could have very similar but subtly different meanings between the two, causing much confusion. With this change, the piece- and node-classification logic is consolidated into one place within the satellite/repair package, so that both subsystems can use it. This ought to make decision-making code more concise and more readable. The consolidated classification logic has been expanded to create more sets, so that the decision-making code does not need to do as much precalculation. It should now be clearer in comments and code that a piece can belong to multiple sets arbitrarily (except where the definition of the sets makes this logically impossible), and what the precise meaning of each set is. These sets include Missing, Suspended, Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair, UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy. Some other side effects of this change: * CreatePutRepairOrderLimits no longer needs to special-case excluded countries; it can just create as many order limits as requested (by way of len(newNodes)). * The repair checker will now queue a segment for repair when there are any pieces out of placement. The code calls this "forcing a repair". * The checker.ReliabilityCache is now accessed by way of a GetNodes() function similar to the one on the overlay. The classification methods like MissingPieces(), OutOfPlacementPieces(), and PiecesNodesLastNetsInOrder() are removed in favor of the classification logic in satellite/repair/classification.go. This means the reliability cache no longer needs access to the placement rules or excluded countries list. Change-Id: I105109fb94ee126952f07d747c6e11131164fadb	2023-09-25 09:42:08 -05:00
Márton Elek	c44e3d78d8	satellite/satellitedb: repairqueue.Select uses placement constraints Change-Id: I59739926f8f6c5eaca3199369d4c5d88a9c08be8	2023-09-25 10:14:25 +00:00
Márton Elek	b4fdc49194	satellite/repair/checker: persist placement information to the queue Change-Id: I51c7fd5a2a38f9f6620c16eddaed3b4915ffd792	2023-09-25 09:33:46 +00:00
Márton Elek	98921f9faa	satellite/overlay: fix placement selection config parsing When we do `satellite run api --placement '...'`, the placement rules are not parsed well. The problem is based on `viper.AllSettings()`, and the main logic is sg. like this (from a new unit test): ``` r := ConfigurablePlacementRule{} err := r.Set(p) require.NoError(t, err) serialized := r.String() r2 := ConfigurablePlacementRule{} err = r2.Set(serialized) require.NoError(t, err) require.Equal(t, p, r2.String()) ``` All settings evaluates the placement rules in `ConfigurablePlacementRules` and stores the string representation. The problem is that we don't have proper `String()` implementation (it prints out the structs instead of the original definition. There are two main solutions for this problem: 1. We can fix the `String()`. When we parse a placement rule, the `String()` method should print out the original definition 2. We can switch to use pure string as configuration parameter, and parse the rules only when required. I feel that 1 is error prone, we can do it (and in this patch I added a lot of `String()` implementations, but it's hard to be sure that our `String()` logic is inline with the parsing logic. Therefore I decided to make the configuration value of the placements a string (or a wrapper around string). That's the main reason why this patch seems to be big, as I updated all the usages. But the main part is in beginning of the `placement.go` (configuration parsing is not a pflag.Value implementation any more, but a separated step). And `filter.go`, (a few more String implementation for filters. https://github.com/storj/storj/issues/6248 Change-Id: I47c762d3514342b76a2e85683b1c891502a0756a	2023-09-21 14:31:41 +00:00
Michal Niewrzal	5d0934e4d9	satellite/metabase/rangedloop: disable ranged loop for tests Currently each testplanet test is running ranged loop no matter if it's used or not. This is small change with some benefits like: * saves some cpu cycles * less log entries * ranged loop won't interfere with other systems Change have no big impact on tests execration but I believe it's nice to have. Change-Id: I731846bf625cac47ed4f3ca3bc1d1a4659bdcce8	2023-09-18 22:51:10 +00:00
Márton Elek	f40baf8629	go.mod: bump dependencies (private,uplink,common) Change-Id: I6c55735b45cadaf36697eff53e78b5b09afe9dea	2023-09-06 13:28:22 +02:00
Márton Elek	e2006d821c	satellite/overlay: change Reliable and KnownReliable as GetParticipatingNodes and GetNodes, respectively. We now want these functions to include offline and suspended nodes as well, so that we can force immediate repair when pieces are out of placement or in excluded countries. With that change, the old names no longer made sense. Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7	2023-09-02 23:34:50 +00:00
Márton Elek	c202929413	satellite/nodeselection: rename (NodeFilter).MatchInclude to Match As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful. But it's not the case any more after `552242387`, where we removed the only one stateful filter. Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083	2023-08-31 16:17:52 +02:00
Márton Elek	84ea80c1fd	satellite/repair/checker: respect autoExcludeSubnet anntation in checker rangedloop This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation. (see in satellite/repair/checker/observer.go). But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go). Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd	2023-08-23 13:45:09 +00:00
Márton Elek	da08117fcd	satellite/~placement: do not ignore placement check for placement=0 There are cases when we would like to override the default placement=0 rule. For example when we would like to exclude tagged nodes from the selection (by default). Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0. TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion. https://github.com/storj/storj/issues/6126 Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4	2023-08-16 07:06:56 +00:00
Michal Niewrzal	7be844351d	satellite/metainfo: remove ServerSideCopyDuplicateMetadata https://github.com/storj/storj/issues/5891 Change-Id: Ib5440169107acca6e832c2280e1ad12dfd380f28	2023-08-08 12:15:10 +00:00
Michal Niewrzal	47a4d4986d	satellite/repair: enable declumping by default This feature flag was disabled by default to test it slowly. Its enabled for some time on one production satellite and test satellites without any issue. We can enable it by default in code. Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c	2023-07-17 15:02:35 +00:00
Michal Niewrzal	1d62dc63f5	satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation Currently, we have issue were while counting unhealthy pieces we are counting twice piece which is in excluded country and is outside segment placement. This can cause unnecessary repair. This change is also doing another step to move RepairExcludedCountryCodes from overlay config into repair package. Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f	2023-07-10 12:01:19 +02:00
Márton Elek	97a89c3476	satellite: switch to use nodefilters instead of old placement.AllowedCountry placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code. The 90% of this patch is just search and replace: * we need to use NodeFilters instead of placement.AllowedCountry * which means, we need an initialized PlacementRules available everywhere * which means we need to configure the placement rules The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change. Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df	2023-07-07 16:55:45 +00:00
Márton Elek	70cdca5d3c	satellite: move satellite/nodeselection/uploadselection => satellite/nodeselection All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload, but for download, repair, etc... Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab	2023-07-07 10:32:03 +02:00
Michal Niewrzal	21c1e66a85	satellite/overlay: refactor ReliabilityCache to keep more data ReliabilityCache will be now using refactored overlay Reliable method. This method will provide more info about nodes (e.g. country code) and with this we are able to add two dedicated methods to classify pieces: * OutOfPlacementPieces * PiecesNodesLastNetsInOrder With those new method we will fix issue where offline but reliable node won't be checked for clumped pieces and off placement pieces. https://github.com/storj/storj/issues/5998 Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82	2023-07-05 11:19:10 +02:00
Michal Niewrzal	f2cd7b0928	satellite/overlay: refactor Reliable to be used with repair checker Currently we are using Reliable to get missing pieces for repair checker. The issue is that now checker is looking at more things than just missing pieces (clumped/off, placement pieces) and using only node ID is not enough. We have issue where we are skipping offline nodes from clumped and off placement pieces check. Reliable was refactored to get data (e.g. country, lastNet) about all reliable nodes. List is split into online and offline. This data will be cached for quick use by repair checker. It will be also possible to check nodes metadata like country code or lastNet. We are also slowly moving `RepairExcludedCountryCodes` config from overlay to repair which makes more sens for it. This this first part of changes. https://github.com/storj/storj/issues/5998 Change-Id: If534342488c0e440affc2894a8fbda6507b8959d	2023-07-05 10:56:31 +02:00
paul cannon	25a5df9752	satellite/repair: don't reuse allNodeIDs We were reusing a slice to save on allocations, but it turns out the function using it was being called in multiple goroutines at the same time. This is definitely a problem with repairer/segments.go. I'm not 100% sure if it also is a problem with checker/observer.go, but I'm making the change there as well to be on the safe side for now. Repair workers only ran with this bug on testing satellites, and it looks like the worst that could have happened was that we repaired pieces off of well-behaved, non-clumped, in-placement nodes by mistake. Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d	2023-06-06 10:28:04 -05:00
Michal Niewrzal	337eb9be6a	satellite/repair/checker: put into queue segment off placement Checker when qualifying segment for repair is now looking at pieces location and if they are outisde segment placement puts them into repair queue. Fixes https://github.com/storj/storj/issues/5895 Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7	2023-06-05 15:53:49 +00:00
paul cannon	3dc01bd25d	satellite/repair: change how we log clumped pieces rather than only logging the last_nets we see in clumpedPieces, this will run through all the last_nets and log any that have more than one node. This should have the same outcome, except the counts will be 1 higher (because FindClumpedPieces won't include the first node found in a clumped network, and this will). This should be quite a bit faster. Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a	2023-05-19 10:38:50 +02:00
paul cannon	c856d45cc0	satellite/overlay: fix GetNodesNetworkInOrder We were using the UploadSelectionCache previously, which does _not_ have all nodes, or even all online nodes, in it. So all nodes with less than MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended for unknown audit errors, or nodes that have started graceful exit, were all missing, and ended up having empty last_nets. Even with all that, I'm kind of surprised how many nodes this involved, but using the upload selection cache was definitely wrong. This change uses the download selection cache instead, which excludes nodes only when they are disqualified, gracefully exited (completely), or offline. Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1	2023-05-19 08:08:08 +00:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
paul cannon	958d8676d0	satellite/overlay: remove unnecessary test helper Change-Id: I8439eec4ed440f60353fc620ca906a917a03613c	2023-05-17 17:04:54 +00:00
paul cannon	1f4f79b6b3	satellite/repair: don't mark clumped segments as irreparable Clumped segments (segments with multiple pieces on the same subnet) may need repair, but the clumped pieces are considered retrievable and we don't need to call such segments irreparable. We do want to know where they're coming from, though, if we can, because we are seeing more than expected. Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9	2023-05-17 16:24:15 +00:00
paul cannon	75d10fe4fa	satellite/overlay: use UploadSelectionCache for GetNodesNetworkInOrder The query for GetNodesNetworkInOrder is causing far too much load on the database. Since it is not critical that the repair checker have perfectly up-to-date node network information, we can use a cache instead. Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878	2023-05-16 17:32:09 +00:00
Michal Niewrzal	4bdbb25d83	satellite/metabase/rangedloop: move Segment definition We will remove segments loop soon so we need first to move Segment definition to rangedloop package. https://github.com/storj/storj/issues/5237 Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363	2023-05-16 12:37:17 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
paul cannon	915f3952af	satellite/repair: repair pieces on the same last_net We avoid putting more than one piece of a segment on the same /24 network (or /64 for ipv6). However, it is possible for multiple pieces of the same segment to move to the same network over time. Nodes can change addresses, or segments could be uploaded with dev settings, etc. We will call such pieces "clumped", as they are clumped into the same net, and are much more likely to be lost or preserved together. This change teaches the repair checker to recognize segments which have clumped pieces, and put them in the repair queue. It also teaches the repair worker to repair such segments (treating clumped pieces as "retrievable but unhealthy"; i.e., they will be replaced on new nodes if possible). Refs: https://github.com/storj/storj/issues/5391 Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908	2023-04-06 17:34:25 +00:00
Márton Elek	ffaf15a3b0	satellite/overlay: remove unused mail service from overlay It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail. Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service. Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7	2023-03-09 12:17:35 +00:00
Michal Niewrzal	aba2f14595	satellite/metabase/rangedloop: few additions for monitoring Additional elements added: * monkit metric for observers methods like Start/Fork/Join/Finish to be able to check how much time those methods are taking * few more logs e.g. entries with processed range * segmentsProcessed metric to be able to check loop progress Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec	2023-02-17 08:46:00 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
Cameron	f06da25c3d	satellite/overlay: add nodeevents.DB to satellite overlay service Add nodeevents.DB to satellite overlay service so we can insert node events into the nodeevents DB. Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156	2022-11-02 15:56:37 +00:00
Cameron	a52f766273	satellite/overlay: add email-sending functionality to overlay service We want to send emails to SNOs. Node status changes go through the overlay service, so it's a good place to add the mail service. Add the mailservice.Service, satellite address, and satellite name to overlay service. Also add feature flag --overlay.send-node-emails Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24	2022-10-13 18:01:05 +00:00
Michal Niewrzal	5dc5f076c9	satellite/repair/checker: remove monitoring from fast methods It looks that monikt monitoring can give high CPU overhead for segments loop observer. With this code we are changing how monitoring is initialized for observer methods. This optimization affects mainly path where segment is healthy and doesn't require repair. Benchmark is also added to show difference between old and new approach. Benchmark against 'main': name old time/op new time/op delta RemoteSegment/Cockroach/healthy_segment-8 8.55µs ± 4% 1.37µs ± 6% -84.03% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/Cockroach/healthy_segment-8 2.63kB ± 0% 0.17kB ± 0% -93.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/Cockroach/healthy_segment-8 54.0 ± 0% 8.0 ± 0% -85.19% (p=0.008 n=5+5) Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18	2022-10-03 12:15:03 +00:00
Michal Niewrzal	1aecca1e76	satellite/repair/checker: tiny cleanup * unused slice removed * variable moved closer to place of use Change-Id: I86126b8337225d4b31cabf89bc9640add7409398	2022-09-26 11:20:10 +00:00
Michal Niewrzal	6cc2052f47	satellite: fix segment loop observers metrics We made optimization for segment loop observers to avoid heavy monkit initialization on each call. It was applied to very often executed methods. Unfortunately we used wrong monkit method to track function times. Instead mon.Task we used mon.Func(). https://github.com/spacemonkeygo/monkit#how-it-works Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8	2022-08-10 14:13:16 +00:00
Egon Elbre	48b0a65fbd	satellite/overlay: use ReadCache in Download/UploadSelectionCache sync2.ReadCache implements preemptive refreshing preventing stalling while it's being updated. Change-Id: Iee9ef36049b986f0e426c14a139b2bc9ac17fb53	2022-07-12 13:52:48 +03:00
Michał Niewrzał	7a2d2a36ca	satellite: use more optimal monkit call for loop observers methods Recently we applied this optimization to metrics observer and time used by its method dropped from 12m to 3m for us1 (220m segments). It looks that it make sense to apply the same code to all observers. Change-Id: I05898aaacbd9bcdf21babc7be9955da1db57bdf2	2022-05-20 11:03:41 +00:00
Erik van Velzen	db1cc8ca95	satellite/repair/checker: buffer repair queue Integrate previous changes. Speed up the segment loop by batch inserting into repair queue. Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276	2022-05-12 16:28:05 +00:00
Qweder93	8b0988708a	satellite/repair: add test that confirms that repairer is ignoring copied segments Resolves https://github.com/storj/storj/issues/4485 Change-Id: Ic772643520124fe3f7eacf8b3bfbbb38982d4769	2022-03-16 09:00:34 +00:00
Fadila Khadar	29fd36a20e	satellite/repairer: handle excluded countries For nodes in excluded areas, we don't necessarily want to remove them from the pointer, but we do want to increase the number of pieces in the segment in case those excluded area nodes go down. To do that, we increase the number of pieces repaired by the number of pieces in excluded areas. Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a	2022-03-14 10:59:36 -04:00
Fadila Khadar	e776c65172	satellite/checker: pieces in excluded countries are not healthy Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection. Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy. With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer. Another change will handle the logic at the repairer level. Fixes https://github.com/storj/team-metainfo/issues/95 Change-Id: I9231b32de117a116488de055a3e94efcabb46e81	2022-03-02 09:59:09 +00:00
Michał Niewrzał	bc161794fc	satellite/metabase: drop DeleteObjectLatestVersion method This method was never used, except tests. Change-Id: Idc1e69b2e2971995b5c4e6cf78a2b5fc69f39ad2	2022-02-02 14:33:48 +00:00
Michał Niewrzał	1fdb0eaa5b	Revert "satellite/metabase: use storj.Nonce instead []byte" This change introduce problems with server side move so let's revert it for now. Problem was found when latest version of storj/storj was used in uplink tests. This reverts commit `1ef06fae99`. Change-Id: I4d4fad5d1ea04ba15ff9d7bd765f7e078e9187c2	2021-10-12 15:39:54 +02:00
Michał Niewrzał	1ef06fae99	satellite/metabase: use storj.Nonce instead []byte We were using mixed types for nonce fields. Protobuf have storj.Nonce, metabase have []byte. This change is a refactoring to have everywere its possible only storj.Nonce. Change-Id: Id54bd8481f30c721cdaf3df79206d25e7cfdab55	2021-10-11 16:13:34 +00:00
Michał Niewrzał	c258f4bbac	private/testplanet: move Metabase outside Metainfo for satellite At some point we moved metabase package outside Metainfo but we didn't do that for satellite structure. This change refactors only tests. When uplink will be adjusted we can remove old entries in Metainfo struct. Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb	2021-09-09 07:15:51 +00:00
Cameron Ayer	a7cda642a5	satellite/repair: add logging for irreparable segments in checker If the checker sees an irreparable segment, log out some info so we can see what the problem is Change-Id: I76eda5270214205f4fefc646d6c391cc13ddcafd	2021-09-02 12:35:29 -04:00
Clement Sam	1f353f3231	segment/{metabase,repair}: change segment created_at column to not accept nulls This change adds a NOT NULL constraint to the created_at column in the segment table. All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc) Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52	2021-08-06 08:16:28 +00:00
Michał Niewrzał	0d8e7905c1	satellite/repair/checker: don't return error when joining loop Error from joining loop should not restart satellite. This will be the same error like for loop itself. In the same way we are handling joining error for other services that are using segment loop. Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e	2021-08-03 12:56:42 +00:00

1 2 3

130 Commits