storj

Author	SHA1	Message	Date
Michal Niewrzal	f5d717735b	satellite/repair: fix checker and repairer logging This is fixing two small issues with logging: * repair checker was logging empty last net values as clumped pieces but in main logic we stopped classifying it this way * repairer summary log was showing incorrect number of pieces removed from segment because list contains duplicated entries There was no real issue here. Change-Id: Ifc6a83637d621d628598200cad00cce44fa4cbf9	2023-10-25 22:55:53 +00:00
Márton Elek	188aa3011b	satellite/repair/checker: report checker_segment_off_placement_count per placement Change-Id: Ic1639899f8f0b55c4ef8fe246e7efc0a5d9a2bc1	2023-10-19 11:59:56 +00:00
Michal Niewrzal	0eaf43120b	satellite/repair/checker: optimize processing, part 3 ClassifySegmentPieces uses custom set implementation instead map. Side note, for custom set implementation I also checked int8 bit set but it didn't give better performance so I used simpler implementation. Benchmark results (compared against part 2 optimization change): name old time/op new time/op delta RemoteSegment/healthy_segment-8 21.7µs ± 8% 15.4µs ±16% -29.38% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/healthy_segment-8 7.41kB ± 0% 1.87kB ± 0% -74.83% (p=0.000 n=5+4) name old allocs/op new allocs/op delta RemoteSegment/healthy_segment-8 150 ± 0% 130 ± 0% -13.33% (p=0.008 n=5+5) Change-Id: I21feca9ec6ac0a2558ac5ce8894451c54f69e52d	2023-10-16 12:06:16 +00:00
Michal Niewrzal	e3e303754b	satellite/repair/checker: optimize processing, part 2 Optimizing collecting monkit metrics: * initialize metrics once at the begining * avoid using string in map for getting stats structs per redundancy Benchmark results (compared against part 1 optimization change): name old time/op new time/op delta RemoteSegment/Cockroach/healthy_segment-8 31.4µs ± 6% 21.7µs ± 8% -30.73% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/healthy_segment-8 10.2kB ± 0% 7.4kB ± 0% -27.03% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/healthy_segment-8 250 ± 0% 150 ± 0% -40.00% (p=0.008 n=5+5) Change-Id: Ie09476eb469a4d6c09e52550c8ba92b3b4b34271	2023-10-12 10:02:53 +02:00
Michal Niewrzal	de4559d862	satellite/repair/checker: optimize processing, part 1 Optimization by reusing more slices. Benchmark result: name old time/op new time/op delta RemoteSegment/healthy_segment-8 33.2µs ± 1% 31.4µs ± 6% -5.49% (p=0.032 n=4+5) name old alloc/op new alloc/op delta RemoteSegment/healthy_segment-8 15.9kB ± 0% 10.2kB ± 0% -35.92% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/healthy_segment-8 280 ± 0% 250 ± 0% -10.71% (p=0.008 n=5+5) Change-Id: I60462169285462dee6cd16d4f4ce1f30fb6cdfdf	2023-10-11 15:50:29 +00:00
paul cannon	72189330fd	satellite/gracefulexit: revamp graceful exit Currently, graceful exit is a complicated subsystem that keeps a queue of all pieces expected to be on a node, and asks the node to transfer those pieces to other nodes one by one. The complexity of the system has, unfortunately, led to numerous bugs and unexpected behaviors. We have decided to remove this entire subsystem and restructure graceful exit as follows: * Nodes will signal their intent to exit gracefully * The satellite will not send any new pieces to gracefully exiting nodes * Pieces on gracefully exiting nodes will be considered by the repair subsystem as "retrievable but unhealthy". They will be repaired off of the exiting node as needed. * After one month (with an appropriately high online score), the node will be considered exited, and held amounts for the node will be released. The repair worker will continue to fetch pieces from the node as long as the node stays online. * If, at the end of the month, a node's online score is below a certain threshold, its graceful exit will fail. Refs: https://github.com/storj/storj/issues/6042 Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b	2023-09-27 08:40:01 +00:00
paul cannon	1b8bd6c082	satellite/repair: unify repair logic The repair checker and repair worker both need to determine which pieces are healthy, which are retrievable, and which should be replaced, but they have been doing it in different ways in different code, which has been the cause of bugs. The same term could have very similar but subtly different meanings between the two, causing much confusion. With this change, the piece- and node-classification logic is consolidated into one place within the satellite/repair package, so that both subsystems can use it. This ought to make decision-making code more concise and more readable. The consolidated classification logic has been expanded to create more sets, so that the decision-making code does not need to do as much precalculation. It should now be clearer in comments and code that a piece can belong to multiple sets arbitrarily (except where the definition of the sets makes this logically impossible), and what the precise meaning of each set is. These sets include Missing, Suspended, Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair, UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy. Some other side effects of this change: * CreatePutRepairOrderLimits no longer needs to special-case excluded countries; it can just create as many order limits as requested (by way of len(newNodes)). * The repair checker will now queue a segment for repair when there are any pieces out of placement. The code calls this "forcing a repair". * The checker.ReliabilityCache is now accessed by way of a GetNodes() function similar to the one on the overlay. The classification methods like MissingPieces(), OutOfPlacementPieces(), and PiecesNodesLastNetsInOrder() are removed in favor of the classification logic in satellite/repair/classification.go. This means the reliability cache no longer needs access to the placement rules or excluded countries list. Change-Id: I105109fb94ee126952f07d747c6e11131164fadb	2023-09-25 09:42:08 -05:00
Márton Elek	b4fdc49194	satellite/repair/checker: persist placement information to the queue Change-Id: I51c7fd5a2a38f9f6620c16eddaed3b4915ffd792	2023-09-25 09:33:46 +00:00
Márton Elek	84ea80c1fd	satellite/repair/checker: respect autoExcludeSubnet anntation in checker rangedloop This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation. (see in satellite/repair/checker/observer.go). But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go). Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd	2023-08-23 13:45:09 +00:00
Márton Elek	da08117fcd	satellite/~placement: do not ignore placement check for placement=0 There are cases when we would like to override the default placement=0 rule. For example when we would like to exclude tagged nodes from the selection (by default). Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0. TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion. https://github.com/storj/storj/issues/6126 Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4	2023-08-16 07:06:56 +00:00
Michal Niewrzal	1d62dc63f5	satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation Currently, we have issue were while counting unhealthy pieces we are counting twice piece which is in excluded country and is outside segment placement. This can cause unnecessary repair. This change is also doing another step to move RepairExcludedCountryCodes from overlay config into repair package. Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f	2023-07-10 12:01:19 +02:00
Márton Elek	97a89c3476	satellite: switch to use nodefilters instead of old placement.AllowedCountry placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code. The 90% of this patch is just search and replace: * we need to use NodeFilters instead of placement.AllowedCountry * which means, we need an initialized PlacementRules available everywhere * which means we need to configure the placement rules The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change. Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df	2023-07-07 16:55:45 +00:00
Michal Niewrzal	21c1e66a85	satellite/overlay: refactor ReliabilityCache to keep more data ReliabilityCache will be now using refactored overlay Reliable method. This method will provide more info about nodes (e.g. country code) and with this we are able to add two dedicated methods to classify pieces: * OutOfPlacementPieces * PiecesNodesLastNetsInOrder With those new method we will fix issue where offline but reliable node won't be checked for clumped pieces and off placement pieces. https://github.com/storj/storj/issues/5998 Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82	2023-07-05 11:19:10 +02:00
Michal Niewrzal	f2cd7b0928	satellite/overlay: refactor Reliable to be used with repair checker Currently we are using Reliable to get missing pieces for repair checker. The issue is that now checker is looking at more things than just missing pieces (clumped/off, placement pieces) and using only node ID is not enough. We have issue where we are skipping offline nodes from clumped and off placement pieces check. Reliable was refactored to get data (e.g. country, lastNet) about all reliable nodes. List is split into online and offline. This data will be cached for quick use by repair checker. It will be also possible to check nodes metadata like country code or lastNet. We are also slowly moving `RepairExcludedCountryCodes` config from overlay to repair which makes more sens for it. This this first part of changes. https://github.com/storj/storj/issues/5998 Change-Id: If534342488c0e440affc2894a8fbda6507b8959d	2023-07-05 10:56:31 +02:00
paul cannon	25a5df9752	satellite/repair: don't reuse allNodeIDs We were reusing a slice to save on allocations, but it turns out the function using it was being called in multiple goroutines at the same time. This is definitely a problem with repairer/segments.go. I'm not 100% sure if it also is a problem with checker/observer.go, but I'm making the change there as well to be on the safe side for now. Repair workers only ran with this bug on testing satellites, and it looks like the worst that could have happened was that we repaired pieces off of well-behaved, non-clumped, in-placement nodes by mistake. Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d	2023-06-06 10:28:04 -05:00
Michal Niewrzal	337eb9be6a	satellite/repair/checker: put into queue segment off placement Checker when qualifying segment for repair is now looking at pieces location and if they are outisde segment placement puts them into repair queue. Fixes https://github.com/storj/storj/issues/5895 Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7	2023-06-05 15:53:49 +00:00
paul cannon	3dc01bd25d	satellite/repair: change how we log clumped pieces rather than only logging the last_nets we see in clumpedPieces, this will run through all the last_nets and log any that have more than one node. This should have the same outcome, except the counts will be 1 higher (because FindClumpedPieces won't include the first node found in a clumped network, and this will). This should be quite a bit faster. Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a	2023-05-19 10:38:50 +02:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
paul cannon	1f4f79b6b3	satellite/repair: don't mark clumped segments as irreparable Clumped segments (segments with multiple pieces on the same subnet) may need repair, but the clumped pieces are considered retrievable and we don't need to call such segments irreparable. We do want to know where they're coming from, though, if we can, because we are seeing more than expected. Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9	2023-05-17 16:24:15 +00:00
Michal Niewrzal	4bdbb25d83	satellite/metabase/rangedloop: move Segment definition We will remove segments loop soon so we need first to move Segment definition to rangedloop package. https://github.com/storj/storj/issues/5237 Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363	2023-05-16 12:37:17 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
Michal Niewrzal	aba2f14595	satellite/metabase/rangedloop: few additions for monitoring Additional elements added: * monkit metric for observers methods like Start/Fork/Join/Finish to be able to check how much time those methods are taking * few more logs e.g. entries with processed range * segmentsProcessed metric to be able to check loop progress Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec	2023-02-17 08:46:00 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00

23 Commits