storj

Author	SHA1	Message	Date
Michal Niewrzal	578724e9b1	satellite/repair/repairer: use KnownReliable to check segment pieces At the moment segment repairer is skipping offline nodes in checks like clumped pieces and off placement pieces. This change is fixing this problem using new version of KnownReliable method. New method is returning both online and offline nodes. Provided data can be used to find clumped and off placement pieces. We are not using DownloadSelectionCache anymore with segment repairer. https://github.com/storj/storj/issues/5998 Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18	2023-06-28 16:53:51 +00:00
paul cannon	355ea2133b	satellite/audit: remove pieces when audits fail When pieces fail an audit (hard fail, meaning the node acknowledged it did not have the piece or the piece was corrupted), we will now remove those pieces from the segment. Previously, we did not do this, and some node operators were seeing the same missing piece audited over and over again and losing reputation every time. This change will include both verification and reverification audits. It will also apply to pieces found to be bad during repair, if repair-to-reputation reporting is enabled. Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7	2023-06-22 14:19:00 +00:00
Michal Niewrzal	203c6be25f	satellite/repair/repairer: test repairing geofenced segment Additional test case to cover situation where we are trying to repair segment with specific placement set. We need to be sure that segment won't be repaired into nodes that are outside segment placement, even if that means that repair will fail. Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91	2023-06-22 13:21:05 +00:00
Michal Niewrzal	cb9a7bdc71	satellite/repair/repairer: make DialTimeout configurable This change makes dial timeout configurable and change it also from defatul 20s to 5s. Main motivation is that during repair we often loose lots of time to dial which eventually will fail. New timeout should be still enough to dial but we will move forward quicker to next node if that one will fail. Timeout is also applied directly as context timeout in case we will use noise of tcp fast open one day. Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea	2023-06-16 12:23:25 +00:00
Michal Niewrzal	7c33521ace	satellite/repair/repairer: use placement to select nodes for repair upload We missed to set placement as a part of selection request. It can case uploading repaired data out of specified placement. I will provide test as a separate change. Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1	2023-06-10 20:55:39 +02:00
paul cannon	25a5df9752	satellite/repair: don't reuse allNodeIDs We were reusing a slice to save on allocations, but it turns out the function using it was being called in multiple goroutines at the same time. This is definitely a problem with repairer/segments.go. I'm not 100% sure if it also is a problem with checker/observer.go, but I'm making the change there as well to be on the safe side for now. Repair workers only ran with this bug on testing satellites, and it looks like the worst that could have happened was that we repaired pieces off of well-behaved, non-clumped, in-placement nodes by mistake. Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d	2023-06-06 10:28:04 -05:00
Michal Niewrzal	337eb9be6a	satellite/repair/checker: put into queue segment off placement Checker when qualifying segment for repair is now looking at pieces location and if they are outisde segment placement puts them into repair queue. Fixes https://github.com/storj/storj/issues/5895 Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7	2023-06-05 15:53:49 +00:00
Michal Niewrzal	128b0a86e3	satellite/repair/repairer: repair pieces out of placement Segment repairer should take into account segment 'placement' field and remove or repair pieces from nodes that are outside this placement. In case when after considering pieces out of placement we are still above repair threshold we are only updating segment pieces to remove problematic pieces. Otherwise we are doing regular repair. https://github.com/storj/storj/issues/5896 Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59	2023-06-05 14:48:36 +00:00
Michal Niewrzal	eabd9dd994	satellite/orders: remove unsed argument Change-Id: I6c5221fc19f97ae6db5627d7239795ff663289e0	2023-05-22 14:35:08 +00:00
paul cannon	3dc01bd25d	satellite/repair: change how we log clumped pieces rather than only logging the last_nets we see in clumpedPieces, this will run through all the last_nets and log any that have more than one node. This should have the same outcome, except the counts will be 1 higher (because FindClumpedPieces won't include the first node found in a clumped network, and this will). This should be quite a bit faster. Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a	2023-05-19 10:38:50 +02:00
paul cannon	c856d45cc0	satellite/overlay: fix GetNodesNetworkInOrder We were using the UploadSelectionCache previously, which does _not_ have all nodes, or even all online nodes, in it. So all nodes with less than MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended for unknown audit errors, or nodes that have started graceful exit, were all missing, and ended up having empty last_nets. Even with all that, I'm kind of surprised how many nodes this involved, but using the upload selection cache was definitely wrong. This change uses the download selection cache instead, which excludes nodes only when they are disqualified, gracefully exited (completely), or offline. Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1	2023-05-19 08:08:08 +00:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
paul cannon	958d8676d0	satellite/overlay: remove unnecessary test helper Change-Id: I8439eec4ed440f60353fc620ca906a917a03613c	2023-05-17 17:04:54 +00:00
paul cannon	1f4f79b6b3	satellite/repair: don't mark clumped segments as irreparable Clumped segments (segments with multiple pieces on the same subnet) may need repair, but the clumped pieces are considered retrievable and we don't need to call such segments irreparable. We do want to know where they're coming from, though, if we can, because we are seeing more than expected. Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9	2023-05-17 16:24:15 +00:00
paul cannon	75d10fe4fa	satellite/overlay: use UploadSelectionCache for GetNodesNetworkInOrder The query for GetNodesNetworkInOrder is causing far too much load on the database. Since it is not critical that the repair checker have perfectly up-to-date node network information, we can use a cache instead. Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878	2023-05-16 17:32:09 +00:00
Michal Niewrzal	4bdbb25d83	satellite/metabase/rangedloop: move Segment definition We will remove segments loop soon so we need first to move Segment definition to rangedloop package. https://github.com/storj/storj/issues/5237 Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363	2023-05-16 12:37:17 +00:00
igor gaidaienko	bc30deee11	satellite/repair: update repair test to use blake3 algo Update repair test to also use blake3 hashing algorithm https://github.com/storj/storj/issues/5649 Change-Id: Id8299576f8be4cfd84ddf9a6b852e653628ada72	2023-05-11 08:45:48 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
Michal Niewrzal	1aa24b9f0d	satellite/audit: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I9cec0ac454f40f19d52c078a8b1870c4d192bd7a	2023-04-24 15:52:11 +00:00
paul cannon	915f3952af	satellite/repair: repair pieces on the same last_net We avoid putting more than one piece of a segment on the same /24 network (or /64 for ipv6). However, it is possible for multiple pieces of the same segment to move to the same network over time. Nodes can change addresses, or segments could be uploaded with dev settings, etc. We will call such pieces "clumped", as they are clumped into the same net, and are much more likely to be lost or preserved together. This change teaches the repair checker to recognize segments which have clumped pieces, and put them in the repair queue. It also teaches the repair worker to repair such segments (treating clumped pieces as "retrievable but unhealthy"; i.e., they will be replaced on new nodes if possible). Refs: https://github.com/storj/storj/issues/5391 Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908	2023-04-06 17:34:25 +00:00
Egon Elbre	48256c91b5	storage: move errors to better locations Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d	2023-04-06 17:26:29 +03:00
Egon Elbre	f5020de57c	storagenode/blobstore: move blob store logic The blobstore implementation is entirely related to storagenode, so the rightful place is together with the storagenode implementation. Fixes https://github.com/storj/storj/issues/5754 Change-Id: Ie6637b0262cf37af6c3e558556c7604d9dc3613d	2023-04-05 18:06:20 +00:00
paul cannon	9e6955cc17	satellite/repair: fix flaky TestFailedDataRepair and friends The following tests should be made less flaky by this change: - TestFailedDataRepair - TestOfflineNodeDataRepair - TestUnknownErrorDataRepair - TestMissingPieceDataRepair_Succeed - TestMissingPieceDataRepair - TestCorruptDataRepair_Succeed - TestCorruptDataRepair_Failed This follows on to a change in commit `6bb64796`. Nearly all tests in the repair suite used to rely on events happening in a certain order. After some of our performance work, those things no longer happen in that expected order every time. This caused much flakiness. The fix in `6bb64796` was sufficient for the tests operating directly on an `*ECRepairer` instance, but not for the tests that make use of the repairer by way of the repair queue and the repair worker. These tests needed a different way to indicate the number of expected failures. This change provides that different way. Refs: https://github.com/storj/storj/issues/5736 Refs: https://github.com/storj/storj/issues/5718 Refs: https://github.com/storj/storj/issues/5715 Refs: https://github.com/storj/storj/issues/5609 Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e	2023-04-04 18:08:52 +00:00
Márton Elek	ffaf15a3b0	satellite/overlay: remove unused mail service from overlay It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail. Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service. Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7	2023-03-09 12:17:35 +00:00
JT Olio	5b0cada4b3	repairer: monitor non-nil limit amount Change-Id: I1a7b7a4a6716783449704cd8a7823090109a14de	2023-03-06 20:39:45 +00:00
paul cannon	20bcdeb8b1	satellite/repair: fix flaky test TestECREpairerGetOffline It was possible to get into a situation where successfulPieces = es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when the succeeding gets all completed before the failures), whereupon the last goroutine in the limiter would sit and wait forever for another goroutine to finish. This change corrects the handling of that situation. As an aside, this is really pretty confusing code and we should think about redoing the whole function. Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69	2023-02-24 09:05:21 -06:00
Michal Niewrzal	16b7901fde	satellite/metabase: add piece size calculation to segment This code is essentially replacement for eestream.CalcPieceSize. To call eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not trivial to get as it requires infectious.FEC. For example infectious.FEC creation is visible on GE loop observer CPU profile because we were doing this for each segment in DB. New method was added to storj.Redundancy and here we are just wiring it with metabase Segment. BenchmarkSegmentPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8 5822 189189 ns/op 9776 B/op 8 allocs/op BenchmarkSegmentPieceSize/segment.PieceSize BenchmarkSegmentPieceSize/segment.PieceSize-8 94721329 11.49 ns/op 0 B/op 0 allocs/op Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295	2023-02-22 11:04:02 +00:00
Egon Elbre	0cdef95d55	all: fix math/rand deprecations Change-Id: I4b966375697c0d409ce24cc7604f806973f8f22a	2023-02-17 15:05:54 +02:00
Michal Niewrzal	aba2f14595	satellite/metabase/rangedloop: few additions for monitoring Additional elements added: * monkit metric for observers methods like Start/Fork/Join/Finish to be able to check how much time those methods are taking * few more logs e.g. entries with processed range * segmentsProcessed metric to be able to check loop progress Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec	2023-02-17 08:46:00 +00:00
paul cannon	6bb6479690	satellite/repair: fix flakiness in tests Several tests using `(ECRepairer).Get()` have begun to exhibit flaky results. The tests are expecting to see failures in certain cases, but the failures are not present. It appears that the cause of this is that, sometimes, the fastest good nodes are able to satisfy the repairer (providing RequiredCount pieces) before the repairer is able to identify the problem scenario we have laid out. In this commit, we add an argument to `(ECRepairer).Get()` which specifies how many failure results are expected. In normal/production conditions, this parameter will be 0, meaning Get need not wait for any errors and should only report those that arrived while waiting for RequiredCount pieces (the existing behavior). But in these tests, we can request that Get() wait for enough results to see the errors we are expecting. Refs: https://github.com/storj/storj/issues/5593 Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8	2023-02-16 07:33:47 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
paul cannon	1854351da6	satellite/audit: teach Reporter about piecewise audits The Reporter is responsible for processing results from auditing operations, logging the results, disqualifying nodes that reached the maximum reverification count, and passing the results on to the reputation system. In this commit, we extend the Reporter so that it knows how to process the results of piecewise reverification audits. We also change most reporter-related tests so that reverifications happen as piecewise reverification audits, exercising the new code. Note that piecewise reverification audits are not yet being done outside of tests. In a later commit, we will switch from doing segmentwise reverifications to piecewise reverifications, as part of the audit-scaling effort. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9	2022-12-12 11:28:02 +00:00
Moby von Briesen	3501656e98	satellite/repair: Add flag to allow disabling reputation updates Reputation updates during repair currently consumes a lot of database resources. Sometimes increasing the rate of repair is more important than auditing a node based on whether they have or don't have the correct piece during repair. This is the job of the audit service. This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089 This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939 Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f	2022-11-24 08:31:11 -05:00
paul cannon	8b494f3740	satellite/audit: use db for auditor queue As part of the effort of splitting out the auditor workers to their own process, we are transitioning the communication between the auditor chore and the verification workers to a queue implemented in the database, rather than the sequence of in-memory queues we used to use. This logical database is safely partitionable from the rest of satelliteDB. Refs: https://github.com/storj/storj/issues/5251 Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc	2022-11-22 14:04:00 +00:00
Cameron	74ddfab810	satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode Also, return node email from overlaycache db DisqualifyNode to be used in node events insertion Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6	2022-11-04 15:18:31 +00:00
Cameron	f06da25c3d	satellite/overlay: add nodeevents.DB to satellite overlay service Add nodeevents.DB to satellite overlay service so we can insert node events into the nodeevents DB. Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156	2022-11-02 15:56:37 +00:00
JT Olio	58a9c55f36	mod: bump dependencies - storj.io/common Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8	2022-10-19 17:01:53 +00:00
Cameron	a52f766273	satellite/overlay: add email-sending functionality to overlay service We want to send emails to SNOs. Node status changes go through the overlay service, so it's a good place to add the mail service. Add the mailservice.Service, satellite address, and satellite name to overlay service. Also add feature flag --overlay.send-node-emails Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24	2022-10-13 18:01:05 +00:00
Egon Elbre	8b70f969b6	all: fix nolint directives Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6	2022-10-11 18:31:20 +00:00
Egon Elbre	ff22fc7ddd	all: fix deprecated ioutil commands Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac	2022-10-11 15:27:29 +00:00
Michal Niewrzal	5dc5f076c9	satellite/repair/checker: remove monitoring from fast methods It looks that monikt monitoring can give high CPU overhead for segments loop observer. With this code we are changing how monitoring is initialized for observer methods. This optimization affects mainly path where segment is healthy and doesn't require repair. Benchmark is also added to show difference between old and new approach. Benchmark against 'main': name old time/op new time/op delta RemoteSegment/Cockroach/healthy_segment-8 8.55µs ± 4% 1.37µs ± 6% -84.03% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/Cockroach/healthy_segment-8 2.63kB ± 0% 0.17kB ± 0% -93.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/Cockroach/healthy_segment-8 54.0 ± 0% 8.0 ± 0% -85.19% (p=0.008 n=5+5) Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18	2022-10-03 12:15:03 +00:00
Michal Niewrzal	1aecca1e76	satellite/repair/checker: tiny cleanup * unused slice removed * variable moved closer to place of use Change-Id: I86126b8337225d4b31cabf89bc9640add7409398	2022-09-26 11:20:10 +00:00
paul cannon	7f1cad6faf	satellite/repair: better handling of piece fetch errors We have an alert on `repair_too_many_nodes_failed` which fires too frequently. Every time so far, it has been because of a network blip of some nature on the satellite side. Satellite operators are expected to have other means in place for alerting on network problems and fixing them, so it's not necessary for the repair framework to act in that way. Instead, in this change, we change the way that `repair_too_many_nodes_failed` works. When a repair fails, we collect piece fetch errors by type and determine from them whether it looks like we are having network problems (most errors are connection failures, possibly also some successful connections which subsequently time out) or whether something else has happened. We will now only emit `repair_too_many_nodes_failed` when the outcome does not look like a network failure. In the network failure case, we will instead emit `repair_suspected_network_problem`. Refs: https://github.com/storj/storj/issues/4669 Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926	2022-09-23 09:35:06 +00:00
paul cannon	7d0885bbaa	satellite/repair: move over audit.Pieces This structure is entirely unused within the audit module, and is only used by repair code. Accordingly, this change moves the structure from audit code to repair code. Also, we take the opportunity here to rename the structure to something less generic. Refs: https://github.com/storj/storj/issues/4669 Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e	2022-09-22 16:43:03 +00:00
Márton Elek	4b1be6bf8e	storagenode/satellite: support different piece hash algorithms Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e	2022-08-23 18:15:06 +00:00
paul cannon	0dcc0a9ee0	satellite/reputation: reconfigure lambda and alpha This is in response to community feedback that our existing reputation calculation is too likely to disqualify storage nodes unfairly with extreme swings up and down. For details and analysis, please see the data_loss_vs_dq_chance_sim.py tool, the "tuning reputation further.ipynb" Jupyter notebook in the storj/datascience repository, and the discussion at https://forum.storj.io/t/tuning-audit-scoring/14084 In brief: changing the lambda and initial-alpha parameters in this way causes the swings in reputation to be smaller and less likely to put a node past the disqualification threshold unfairly. Note: this change will cause a one-time reset of all (non-disqualified) node reputations, because the new initial alpha value of 1000 is dramatically different, and the disqualification threshold is going to be much higher. Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3	2022-08-17 18:52:53 +00:00
paul cannon	37a4edbaff	all: reformat comments as required by gofmt 1.19 I don't know why the go people thought this was a good idea, because this automatic reformatting is bound to do the wrong thing sometimes, which is very annoying. But I don't see a way to turn it off, so best to get this change out of the way. Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3	2022-08-10 18:24:55 +00:00
Michal Niewrzal	6cc2052f47	satellite: fix segment loop observers metrics We made optimization for segment loop observers to avoid heavy monkit initialization on each call. It was applied to very often executed methods. Unfortunately we used wrong monkit method to track function times. Instead mon.Task we used mon.Func(). https://github.com/spacemonkeygo/monkit#how-it-works Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8	2022-08-10 14:13:16 +00:00
paul cannon	726c95160b	satellite/repair: avoid retrying GET_REPAIR incorrectly We retry a GET_REPAIR operation in one case, and one case only (as far as I can determine): when we are trying to connect to a node using its last known working IP and port combination rather than its supplied hostname, and we think the operation failed the first time because of a Dial failure. However, logs collected from storage node operators along with logs collected from satellites are strongly indicating that we are retrying GET_REPAIR operations in some cases even when we succeeded in connecting to the node the first time. This results in the node complaining loudly about being given a duplicate order limit (as it should), whereupon the satellite counts that as an unknown error and potentially penalizes the node. See discussion at https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36 . Investigation into this problem has revealed that `!piecestore.CloseError.Has(err)` may not be the best way of determining whether a problem occurred during Dial. In fact, it is probably downright Wrong. Handling of errors on a stream is somewhat complicated, but it would appear that there are several paths by which an RPC error originating on the remote side might show up during the Close() call, and would thus be labeled as a "CloseError". This change creates a new error class, repairer.ErrDialFailed, with which we will now wrap errors that _really definitely_ occurred during a Dial call. We will use this class to determine whether or not to retry a GET_REPAIR operation. The error will still also be wrapped with whatever wrapper classes it used to be wrapped with, so the potential for breakage here should be minimal. Refs: https://github.com/storj/storj/issues/4687 Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb	2022-07-18 05:11:56 +00:00
paul cannon	2f20bbf4d8	satellite/reputation: add a reputation write cache This should lower the amount of database load coming from reputation updates. Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949	2022-07-14 21:40:16 +00:00

1 2 3 4 5 ...

273 Commits