storj

Author	SHA1	Message	Date
Michal Niewrzal	0eaf43120b	satellite/repair/checker: optimize processing, part 3 ClassifySegmentPieces uses custom set implementation instead map. Side note, for custom set implementation I also checked int8 bit set but it didn't give better performance so I used simpler implementation. Benchmark results (compared against part 2 optimization change): name old time/op new time/op delta RemoteSegment/healthy_segment-8 21.7µs ± 8% 15.4µs ±16% -29.38% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/healthy_segment-8 7.41kB ± 0% 1.87kB ± 0% -74.83% (p=0.000 n=5+4) name old allocs/op new allocs/op delta RemoteSegment/healthy_segment-8 150 ± 0% 130 ± 0% -13.33% (p=0.008 n=5+5) Change-Id: I21feca9ec6ac0a2558ac5ce8894451c54f69e52d	2023-10-16 12:06:16 +00:00
paul cannon	ee33cb1289	satellite/repair: protect concurrent access to statsCollector It would appear that we have been making concurrent accesses to statsCollector for a long, long time (we expect there to be multiple calls to `Repair()` at the same time on the same instance of `SegmentRepairer`, up to `config.MaxRepair`, and before this change there was no sort of synchronization guarding accesses to the `statsCollector.stats` map. Refs: https://github.com/storj/storj/issues/6402 Change-Id: I5bcdd13c88913a8d66f6dd906c9037c588960cc9	2023-10-13 09:12:00 -05:00
Michal Niewrzal	de4559d862	satellite/repair/checker: optimize processing, part 1 Optimization by reusing more slices. Benchmark result: name old time/op new time/op delta RemoteSegment/healthy_segment-8 33.2µs ± 1% 31.4µs ± 6% -5.49% (p=0.032 n=4+5) name old alloc/op new alloc/op delta RemoteSegment/healthy_segment-8 15.9kB ± 0% 10.2kB ± 0% -35.92% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/healthy_segment-8 280 ± 0% 250 ± 0% -10.71% (p=0.008 n=5+5) Change-Id: I60462169285462dee6cd16d4f4ce1f30fb6cdfdf	2023-10-11 15:50:29 +00:00
Márton Elek	58b98bc335	satellite/repair: repair is configurable to work only on included/excluded placements This patch finishes the placement aware repair. We already introduced the parameters to select only the jobs for specific placements, the remaining part is just to configure the exclude/include rules. + a full e2e unit test. Change-Id: I223ba84e8ab7481a53e5a444596c7a5ae51573c5	2023-09-27 14:54:06 +00:00
paul cannon	72189330fd	satellite/gracefulexit: revamp graceful exit Currently, graceful exit is a complicated subsystem that keeps a queue of all pieces expected to be on a node, and asks the node to transfer those pieces to other nodes one by one. The complexity of the system has, unfortunately, led to numerous bugs and unexpected behaviors. We have decided to remove this entire subsystem and restructure graceful exit as follows: * Nodes will signal their intent to exit gracefully * The satellite will not send any new pieces to gracefully exiting nodes * Pieces on gracefully exiting nodes will be considered by the repair subsystem as "retrievable but unhealthy". They will be repaired off of the exiting node as needed. * After one month (with an appropriately high online score), the node will be considered exited, and held amounts for the node will be released. The repair worker will continue to fetch pieces from the node as long as the node stays online. * If, at the end of the month, a node's online score is below a certain threshold, its graceful exit will fail. Refs: https://github.com/storj/storj/issues/6042 Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b	2023-09-27 08:40:01 +00:00
paul cannon	1b8bd6c082	satellite/repair: unify repair logic The repair checker and repair worker both need to determine which pieces are healthy, which are retrievable, and which should be replaced, but they have been doing it in different ways in different code, which has been the cause of bugs. The same term could have very similar but subtly different meanings between the two, causing much confusion. With this change, the piece- and node-classification logic is consolidated into one place within the satellite/repair package, so that both subsystems can use it. This ought to make decision-making code more concise and more readable. The consolidated classification logic has been expanded to create more sets, so that the decision-making code does not need to do as much precalculation. It should now be clearer in comments and code that a piece can belong to multiple sets arbitrarily (except where the definition of the sets makes this logically impossible), and what the precise meaning of each set is. These sets include Missing, Suspended, Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair, UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy. Some other side effects of this change: * CreatePutRepairOrderLimits no longer needs to special-case excluded countries; it can just create as many order limits as requested (by way of len(newNodes)). * The repair checker will now queue a segment for repair when there are any pieces out of placement. The code calls this "forcing a repair". * The checker.ReliabilityCache is now accessed by way of a GetNodes() function similar to the one on the overlay. The classification methods like MissingPieces(), OutOfPlacementPieces(), and PiecesNodesLastNetsInOrder() are removed in favor of the classification logic in satellite/repair/classification.go. This means the reliability cache no longer needs access to the placement rules or excluded countries list. Change-Id: I105109fb94ee126952f07d747c6e11131164fadb	2023-09-25 09:42:08 -05:00
Márton Elek	c44e3d78d8	satellite/satellitedb: repairqueue.Select uses placement constraints Change-Id: I59739926f8f6c5eaca3199369d4c5d88a9c08be8	2023-09-25 10:14:25 +00:00
Márton Elek	98921f9faa	satellite/overlay: fix placement selection config parsing When we do `satellite run api --placement '...'`, the placement rules are not parsed well. The problem is based on `viper.AllSettings()`, and the main logic is sg. like this (from a new unit test): ``` r := ConfigurablePlacementRule{} err := r.Set(p) require.NoError(t, err) serialized := r.String() r2 := ConfigurablePlacementRule{} err = r2.Set(serialized) require.NoError(t, err) require.Equal(t, p, r2.String()) ``` All settings evaluates the placement rules in `ConfigurablePlacementRules` and stores the string representation. The problem is that we don't have proper `String()` implementation (it prints out the structs instead of the original definition. There are two main solutions for this problem: 1. We can fix the `String()`. When we parse a placement rule, the `String()` method should print out the original definition 2. We can switch to use pure string as configuration parameter, and parse the rules only when required. I feel that 1 is error prone, we can do it (and in this patch I added a lot of `String()` implementations, but it's hard to be sure that our `String()` logic is inline with the parsing logic. Therefore I decided to make the configuration value of the placements a string (or a wrapper around string). That's the main reason why this patch seems to be big, as I updated all the usages. But the main part is in beginning of the `placement.go` (configuration parsing is not a pflag.Value implementation any more, but a separated step). And `filter.go`, (a few more String implementation for filters. https://github.com/storj/storj/issues/6248 Change-Id: I47c762d3514342b76a2e85683b1c891502a0756a	2023-09-21 14:31:41 +00:00
Márton Elek	e2006d821c	satellite/overlay: change Reliable and KnownReliable as GetParticipatingNodes and GetNodes, respectively. We now want these functions to include offline and suspended nodes as well, so that we can force immediate repair when pieces are out of placement or in excluded countries. With that change, the old names no longer made sense. Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7	2023-09-02 23:34:50 +00:00
Márton Elek	c202929413	satellite/nodeselection: rename (NodeFilter).MatchInclude to Match As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful. But it's not the case any more after `552242387`, where we removed the only one stateful filter. Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083	2023-08-31 16:17:52 +02:00
Márton Elek	84ea80c1fd	satellite/repair/checker: respect autoExcludeSubnet anntation in checker rangedloop This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation. (see in satellite/repair/checker/observer.go). But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go). Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd	2023-08-23 13:45:09 +00:00
Márton Elek	f0afe0d2ea	satelite/repairer: ignore declumping when subnet filtering is turned off with filter annotation + restoring the functionality of repairer.doPlacementCheck Change-Id: I75521f2da280758345face07eeea661765717318	2023-08-18 09:35:38 +00:00
Márton Elek	9ddc8b4ca3	satellite/repair: piecescheck.OutOfPlacementPiecesSet should not contain offline nodes When we check the availability of the pieces, we do: ``` result.NumUnhealthyRetrievable = len(result.ClumpedPiecesSet) + len(result.OutOfPlacementPiecesSet) // + some magic if there are overlaps between them numHealthy := len(pieces) - len(piecesCheck.MissingPiecesSet) - piecesCheck.NumUnhealthyRetrievable ``` This works only if OutOfPlacementPieceSet doesn't contain the offline nodes (which are already included in MissingPieceSet). But `result.OutOfPlacementPieces.Set` should include all the nodes (even offline), as in case of lucky conditions, we are able to remove those pieces from DB. The solution is to remove all offline nodes from `NumUnhealthyRetrievable`. Change-Id: I90baa0396352dd040e1e1516314b3271f8712034	2023-08-16 17:35:10 +00:00
Márton Elek	de7aabc8c9	satellite/{repair,rangedloop,overlay}: fix node tag placement selection for repair This patch fixes the node tag based placement of rangedloop/repairchecker + repair process. The main change is just adding the node tags for Reliable and KnownReliabel database calls + adding new tests to prove, it works. https://github.com/storj/storj/issues/6126 Change-Id: I245d654a18c1d61b2c72df49afa0718d0de76da1	2023-08-16 15:45:41 +00:00
Márton Elek	da08117fcd	satellite/~placement: do not ignore placement check for placement=0 There are cases when we would like to override the default placement=0 rule. For example when we would like to exclude tagged nodes from the selection (by default). Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0. TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion. https://github.com/storj/storj/issues/6126 Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4	2023-08-16 07:06:56 +00:00
Michal Niewrzal	47a4d4986d	satellite/repair: enable declumping by default This feature flag was disabled by default to test it slowly. Its enabled for some time on one production satellite and test satellites without any issue. We can enable it by default in code. Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c	2023-07-17 15:02:35 +00:00
Michal Niewrzal	5234727886	satellite/repair/repairer: fix flaky TestSegmentRepairPlacement Sometimes DownloadSelectionCache doesn't keep up with all node placement changes we are doing during this test. Change-Id: Idbda6511e3324b560cee3be85f980bf8d5b9b7ef	2023-07-14 10:10:40 +00:00
Michal Niewrzal	1d62dc63f5	satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation Currently, we have issue were while counting unhealthy pieces we are counting twice piece which is in excluded country and is outside segment placement. This can cause unnecessary repair. This change is also doing another step to move RepairExcludedCountryCodes from overlay config into repair package. Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f	2023-07-10 12:01:19 +02:00
Márton Elek	97a89c3476	satellite: switch to use nodefilters instead of old placement.AllowedCountry placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code. The 90% of this patch is just search and replace: * we need to use NodeFilters instead of placement.AllowedCountry * which means, we need an initialized PlacementRules available everywhere * which means we need to configure the placement rules The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change. Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df	2023-07-07 16:55:45 +00:00
Márton Elek	70cdca5d3c	satellite: move satellite/nodeselection/uploadselection => satellite/nodeselection All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload, but for download, repair, etc... Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab	2023-07-07 10:32:03 +02:00
Márton Elek	d38b8fa2c4	satellite/nodeselection: use the same Node object from overlay and nodeselection We use two different Node types in `overlay` and `uploadnodeselection` and converting back and forth. Using the same object would allow us to use a unified node selection interface everywhere. Change-Id: Ie71e29d60184ee0e5b4547eb54325f09c418f73c	2023-07-03 16:59:33 +00:00
Michal Niewrzal	578724e9b1	satellite/repair/repairer: use KnownReliable to check segment pieces At the moment segment repairer is skipping offline nodes in checks like clumped pieces and off placement pieces. This change is fixing this problem using new version of KnownReliable method. New method is returning both online and offline nodes. Provided data can be used to find clumped and off placement pieces. We are not using DownloadSelectionCache anymore with segment repairer. https://github.com/storj/storj/issues/5998 Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18	2023-06-28 16:53:51 +00:00
paul cannon	355ea2133b	satellite/audit: remove pieces when audits fail When pieces fail an audit (hard fail, meaning the node acknowledged it did not have the piece or the piece was corrupted), we will now remove those pieces from the segment. Previously, we did not do this, and some node operators were seeing the same missing piece audited over and over again and losing reputation every time. This change will include both verification and reverification audits. It will also apply to pieces found to be bad during repair, if repair-to-reputation reporting is enabled. Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7	2023-06-22 14:19:00 +00:00
Michal Niewrzal	203c6be25f	satellite/repair/repairer: test repairing geofenced segment Additional test case to cover situation where we are trying to repair segment with specific placement set. We need to be sure that segment won't be repaired into nodes that are outside segment placement, even if that means that repair will fail. Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91	2023-06-22 13:21:05 +00:00
Michal Niewrzal	cb9a7bdc71	satellite/repair/repairer: make DialTimeout configurable This change makes dial timeout configurable and change it also from defatul 20s to 5s. Main motivation is that during repair we often loose lots of time to dial which eventually will fail. New timeout should be still enough to dial but we will move forward quicker to next node if that one will fail. Timeout is also applied directly as context timeout in case we will use noise of tcp fast open one day. Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea	2023-06-16 12:23:25 +00:00
Michal Niewrzal	7c33521ace	satellite/repair/repairer: use placement to select nodes for repair upload We missed to set placement as a part of selection request. It can case uploading repaired data out of specified placement. I will provide test as a separate change. Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1	2023-06-10 20:55:39 +02:00
paul cannon	25a5df9752	satellite/repair: don't reuse allNodeIDs We were reusing a slice to save on allocations, but it turns out the function using it was being called in multiple goroutines at the same time. This is definitely a problem with repairer/segments.go. I'm not 100% sure if it also is a problem with checker/observer.go, but I'm making the change there as well to be on the safe side for now. Repair workers only ran with this bug on testing satellites, and it looks like the worst that could have happened was that we repaired pieces off of well-behaved, non-clumped, in-placement nodes by mistake. Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d	2023-06-06 10:28:04 -05:00
Michal Niewrzal	128b0a86e3	satellite/repair/repairer: repair pieces out of placement Segment repairer should take into account segment 'placement' field and remove or repair pieces from nodes that are outside this placement. In case when after considering pieces out of placement we are still above repair threshold we are only updating segment pieces to remove problematic pieces. Otherwise we are doing regular repair. https://github.com/storj/storj/issues/5896 Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59	2023-06-05 14:48:36 +00:00
Michal Niewrzal	eabd9dd994	satellite/orders: remove unsed argument Change-Id: I6c5221fc19f97ae6db5627d7239795ff663289e0	2023-05-22 14:35:08 +00:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
paul cannon	915f3952af	satellite/repair: repair pieces on the same last_net We avoid putting more than one piece of a segment on the same /24 network (or /64 for ipv6). However, it is possible for multiple pieces of the same segment to move to the same network over time. Nodes can change addresses, or segments could be uploaded with dev settings, etc. We will call such pieces "clumped", as they are clumped into the same net, and are much more likely to be lost or preserved together. This change teaches the repair checker to recognize segments which have clumped pieces, and put them in the repair queue. It also teaches the repair worker to repair such segments (treating clumped pieces as "retrievable but unhealthy"; i.e., they will be replaced on new nodes if possible). Refs: https://github.com/storj/storj/issues/5391 Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908	2023-04-06 17:34:25 +00:00
Egon Elbre	48256c91b5	storage: move errors to better locations Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d	2023-04-06 17:26:29 +03:00
paul cannon	9e6955cc17	satellite/repair: fix flaky TestFailedDataRepair and friends The following tests should be made less flaky by this change: - TestFailedDataRepair - TestOfflineNodeDataRepair - TestUnknownErrorDataRepair - TestMissingPieceDataRepair_Succeed - TestMissingPieceDataRepair - TestCorruptDataRepair_Succeed - TestCorruptDataRepair_Failed This follows on to a change in commit `6bb64796`. Nearly all tests in the repair suite used to rely on events happening in a certain order. After some of our performance work, those things no longer happen in that expected order every time. This caused much flakiness. The fix in `6bb64796` was sufficient for the tests operating directly on an `*ECRepairer` instance, but not for the tests that make use of the repairer by way of the repair queue and the repair worker. These tests needed a different way to indicate the number of expected failures. This change provides that different way. Refs: https://github.com/storj/storj/issues/5736 Refs: https://github.com/storj/storj/issues/5718 Refs: https://github.com/storj/storj/issues/5715 Refs: https://github.com/storj/storj/issues/5609 Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e	2023-04-04 18:08:52 +00:00
JT Olio	5b0cada4b3	repairer: monitor non-nil limit amount Change-Id: I1a7b7a4a6716783449704cd8a7823090109a14de	2023-03-06 20:39:45 +00:00
paul cannon	20bcdeb8b1	satellite/repair: fix flaky test TestECREpairerGetOffline It was possible to get into a situation where successfulPieces = es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when the succeeding gets all completed before the failures), whereupon the last goroutine in the limiter would sit and wait forever for another goroutine to finish. This change corrects the handling of that situation. As an aside, this is really pretty confusing code and we should think about redoing the whole function. Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69	2023-02-24 09:05:21 -06:00
Michal Niewrzal	16b7901fde	satellite/metabase: add piece size calculation to segment This code is essentially replacement for eestream.CalcPieceSize. To call eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not trivial to get as it requires infectious.FEC. For example infectious.FEC creation is visible on GE loop observer CPU profile because we were doing this for each segment in DB. New method was added to storj.Redundancy and here we are just wiring it with metabase Segment. BenchmarkSegmentPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8 5822 189189 ns/op 9776 B/op 8 allocs/op BenchmarkSegmentPieceSize/segment.PieceSize BenchmarkSegmentPieceSize/segment.PieceSize-8 94721329 11.49 ns/op 0 B/op 0 allocs/op Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295	2023-02-22 11:04:02 +00:00
paul cannon	6bb6479690	satellite/repair: fix flakiness in tests Several tests using `(ECRepairer).Get()` have begun to exhibit flaky results. The tests are expecting to see failures in certain cases, but the failures are not present. It appears that the cause of this is that, sometimes, the fastest good nodes are able to satisfy the repairer (providing RequiredCount pieces) before the repairer is able to identify the problem scenario we have laid out. In this commit, we add an argument to `(ECRepairer).Get()` which specifies how many failure results are expected. In normal/production conditions, this parameter will be 0, meaning Get need not wait for any errors and should only report those that arrived while waiting for RequiredCount pieces (the existing behavior). But in these tests, we can request that Get() wait for enough results to see the errors we are expecting. Refs: https://github.com/storj/storj/issues/5593 Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8	2023-02-16 07:33:47 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
paul cannon	1854351da6	satellite/audit: teach Reporter about piecewise audits The Reporter is responsible for processing results from auditing operations, logging the results, disqualifying nodes that reached the maximum reverification count, and passing the results on to the reputation system. In this commit, we extend the Reporter so that it knows how to process the results of piecewise reverification audits. We also change most reporter-related tests so that reverifications happen as piecewise reverification audits, exercising the new code. Note that piecewise reverification audits are not yet being done outside of tests. In a later commit, we will switch from doing segmentwise reverifications to piecewise reverifications, as part of the audit-scaling effort. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9	2022-12-12 11:28:02 +00:00
Moby von Briesen	3501656e98	satellite/repair: Add flag to allow disabling reputation updates Reputation updates during repair currently consumes a lot of database resources. Sometimes increasing the rate of repair is more important than auditing a node based on whether they have or don't have the correct piece during repair. This is the job of the audit service. This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089 This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939 Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f	2022-11-24 08:31:11 -05:00
JT Olio	58a9c55f36	mod: bump dependencies - storj.io/common Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8	2022-10-19 17:01:53 +00:00
Egon Elbre	ff22fc7ddd	all: fix deprecated ioutil commands Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac	2022-10-11 15:27:29 +00:00
paul cannon	7f1cad6faf	satellite/repair: better handling of piece fetch errors We have an alert on `repair_too_many_nodes_failed` which fires too frequently. Every time so far, it has been because of a network blip of some nature on the satellite side. Satellite operators are expected to have other means in place for alerting on network problems and fixing them, so it's not necessary for the repair framework to act in that way. Instead, in this change, we change the way that `repair_too_many_nodes_failed` works. When a repair fails, we collect piece fetch errors by type and determine from them whether it looks like we are having network problems (most errors are connection failures, possibly also some successful connections which subsequently time out) or whether something else has happened. We will now only emit `repair_too_many_nodes_failed` when the outcome does not look like a network failure. In the network failure case, we will instead emit `repair_suspected_network_problem`. Refs: https://github.com/storj/storj/issues/4669 Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926	2022-09-23 09:35:06 +00:00
paul cannon	7d0885bbaa	satellite/repair: move over audit.Pieces This structure is entirely unused within the audit module, and is only used by repair code. Accordingly, this change moves the structure from audit code to repair code. Also, we take the opportunity here to rename the structure to something less generic. Refs: https://github.com/storj/storj/issues/4669 Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e	2022-09-22 16:43:03 +00:00
Márton Elek	4b1be6bf8e	storagenode/satellite: support different piece hash algorithms Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e	2022-08-23 18:15:06 +00:00
paul cannon	726c95160b	satellite/repair: avoid retrying GET_REPAIR incorrectly We retry a GET_REPAIR operation in one case, and one case only (as far as I can determine): when we are trying to connect to a node using its last known working IP and port combination rather than its supplied hostname, and we think the operation failed the first time because of a Dial failure. However, logs collected from storage node operators along with logs collected from satellites are strongly indicating that we are retrying GET_REPAIR operations in some cases even when we succeeded in connecting to the node the first time. This results in the node complaining loudly about being given a duplicate order limit (as it should), whereupon the satellite counts that as an unknown error and potentially penalizes the node. See discussion at https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36 . Investigation into this problem has revealed that `!piecestore.CloseError.Has(err)` may not be the best way of determining whether a problem occurred during Dial. In fact, it is probably downright Wrong. Handling of errors on a stream is somewhat complicated, but it would appear that there are several paths by which an RPC error originating on the remote side might show up during the Close() call, and would thus be labeled as a "CloseError". This change creates a new error class, repairer.ErrDialFailed, with which we will now wrap errors that _really definitely_ occurred during a Dial call. We will use this class to determine whether or not to retry a GET_REPAIR operation. The error will still also be wrapped with whatever wrapper classes it used to be wrapped with, so the potential for breakage here should be minimal. Refs: https://github.com/storj/storj/issues/4687 Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb	2022-07-18 05:11:56 +00:00
Erik van Velzen	f23d5eb5a1	satellite/repair: remove superfluous conditional Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68	2022-06-30 18:09:17 +00:00
paul cannon	fd01c6cc25	satellite/{repair,audit}: simplify reputation reporter Also, make it an interface so that the upcoming write cache can be dropped in to the same place. Change-Id: I2c286743825e647c0cef5b6578245391851fa10c	2022-05-10 14:04:43 +00:00
paul cannon	985ccbe721	satellite/repair: in dns redial, don't retry if CloseError To save load on DNS servers, the repair code first tries to dial the last known good ip and port for a node, and then falls back to a DNS lookup only if we fail to connect to the last known good ip and port. However, it looks like we are seeing errors during the client stream Close() call (probably due to quic-go code), and those are classified the same as errors encountered during Dial. The repairer code sees this error, assumes that we failed to contact the node, and retries- but since we did actually succeed in connecting the first time around, this results in submitting the same order limit (with the same serial number) to the storage node, which (rightfully) rejects it. So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in storj/uplink, this should avoid the double submission and avoid dinging nodes' suspension scores unfairly. See https://github.com/storj/storj/issues/4687. Also, moving the testsuite directory check up above check-monkit in the Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized and handled before everything breaks weirdly and seemingly randomly later on. Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655	2022-04-04 17:01:09 +00:00

1 2 3 4

151 Commits