storj

Author	SHA1	Message	Date
Michal Niewrzal	578724e9b1	satellite/repair/repairer: use KnownReliable to check segment pieces At the moment segment repairer is skipping offline nodes in checks like clumped pieces and off placement pieces. This change is fixing this problem using new version of KnownReliable method. New method is returning both online and offline nodes. Provided data can be used to find clumped and off placement pieces. We are not using DownloadSelectionCache anymore with segment repairer. https://github.com/storj/storj/issues/5998 Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18	2023-06-28 16:53:51 +00:00
paul cannon	355ea2133b	satellite/audit: remove pieces when audits fail When pieces fail an audit (hard fail, meaning the node acknowledged it did not have the piece or the piece was corrupted), we will now remove those pieces from the segment. Previously, we did not do this, and some node operators were seeing the same missing piece audited over and over again and losing reputation every time. This change will include both verification and reverification audits. It will also apply to pieces found to be bad during repair, if repair-to-reputation reporting is enabled. Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7	2023-06-22 14:19:00 +00:00
Michal Niewrzal	203c6be25f	satellite/repair/repairer: test repairing geofenced segment Additional test case to cover situation where we are trying to repair segment with specific placement set. We need to be sure that segment won't be repaired into nodes that are outside segment placement, even if that means that repair will fail. Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91	2023-06-22 13:21:05 +00:00
Michal Niewrzal	cb9a7bdc71	satellite/repair/repairer: make DialTimeout configurable This change makes dial timeout configurable and change it also from defatul 20s to 5s. Main motivation is that during repair we often loose lots of time to dial which eventually will fail. New timeout should be still enough to dial but we will move forward quicker to next node if that one will fail. Timeout is also applied directly as context timeout in case we will use noise of tcp fast open one day. Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea	2023-06-16 12:23:25 +00:00
Michal Niewrzal	7c33521ace	satellite/repair/repairer: use placement to select nodes for repair upload We missed to set placement as a part of selection request. It can case uploading repaired data out of specified placement. I will provide test as a separate change. Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1	2023-06-10 20:55:39 +02:00
paul cannon	25a5df9752	satellite/repair: don't reuse allNodeIDs We were reusing a slice to save on allocations, but it turns out the function using it was being called in multiple goroutines at the same time. This is definitely a problem with repairer/segments.go. I'm not 100% sure if it also is a problem with checker/observer.go, but I'm making the change there as well to be on the safe side for now. Repair workers only ran with this bug on testing satellites, and it looks like the worst that could have happened was that we repaired pieces off of well-behaved, non-clumped, in-placement nodes by mistake. Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d	2023-06-06 10:28:04 -05:00
Michal Niewrzal	128b0a86e3	satellite/repair/repairer: repair pieces out of placement Segment repairer should take into account segment 'placement' field and remove or repair pieces from nodes that are outside this placement. In case when after considering pieces out of placement we are still above repair threshold we are only updating segment pieces to remove problematic pieces. Otherwise we are doing regular repair. https://github.com/storj/storj/issues/5896 Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59	2023-06-05 14:48:36 +00:00
Michal Niewrzal	eabd9dd994	satellite/orders: remove unsed argument Change-Id: I6c5221fc19f97ae6db5627d7239795ff663289e0	2023-05-22 14:35:08 +00:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
paul cannon	915f3952af	satellite/repair: repair pieces on the same last_net We avoid putting more than one piece of a segment on the same /24 network (or /64 for ipv6). However, it is possible for multiple pieces of the same segment to move to the same network over time. Nodes can change addresses, or segments could be uploaded with dev settings, etc. We will call such pieces "clumped", as they are clumped into the same net, and are much more likely to be lost or preserved together. This change teaches the repair checker to recognize segments which have clumped pieces, and put them in the repair queue. It also teaches the repair worker to repair such segments (treating clumped pieces as "retrievable but unhealthy"; i.e., they will be replaced on new nodes if possible). Refs: https://github.com/storj/storj/issues/5391 Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908	2023-04-06 17:34:25 +00:00
Egon Elbre	48256c91b5	storage: move errors to better locations Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d	2023-04-06 17:26:29 +03:00
paul cannon	9e6955cc17	satellite/repair: fix flaky TestFailedDataRepair and friends The following tests should be made less flaky by this change: - TestFailedDataRepair - TestOfflineNodeDataRepair - TestUnknownErrorDataRepair - TestMissingPieceDataRepair_Succeed - TestMissingPieceDataRepair - TestCorruptDataRepair_Succeed - TestCorruptDataRepair_Failed This follows on to a change in commit `6bb64796`. Nearly all tests in the repair suite used to rely on events happening in a certain order. After some of our performance work, those things no longer happen in that expected order every time. This caused much flakiness. The fix in `6bb64796` was sufficient for the tests operating directly on an `*ECRepairer` instance, but not for the tests that make use of the repairer by way of the repair queue and the repair worker. These tests needed a different way to indicate the number of expected failures. This change provides that different way. Refs: https://github.com/storj/storj/issues/5736 Refs: https://github.com/storj/storj/issues/5718 Refs: https://github.com/storj/storj/issues/5715 Refs: https://github.com/storj/storj/issues/5609 Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e	2023-04-04 18:08:52 +00:00
JT Olio	5b0cada4b3	repairer: monitor non-nil limit amount Change-Id: I1a7b7a4a6716783449704cd8a7823090109a14de	2023-03-06 20:39:45 +00:00
paul cannon	20bcdeb8b1	satellite/repair: fix flaky test TestECREpairerGetOffline It was possible to get into a situation where successfulPieces = es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when the succeeding gets all completed before the failures), whereupon the last goroutine in the limiter would sit and wait forever for another goroutine to finish. This change corrects the handling of that situation. As an aside, this is really pretty confusing code and we should think about redoing the whole function. Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69	2023-02-24 09:05:21 -06:00
Michal Niewrzal	16b7901fde	satellite/metabase: add piece size calculation to segment This code is essentially replacement for eestream.CalcPieceSize. To call eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not trivial to get as it requires infectious.FEC. For example infectious.FEC creation is visible on GE loop observer CPU profile because we were doing this for each segment in DB. New method was added to storj.Redundancy and here we are just wiring it with metabase Segment. BenchmarkSegmentPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8 5822 189189 ns/op 9776 B/op 8 allocs/op BenchmarkSegmentPieceSize/segment.PieceSize BenchmarkSegmentPieceSize/segment.PieceSize-8 94721329 11.49 ns/op 0 B/op 0 allocs/op Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295	2023-02-22 11:04:02 +00:00
paul cannon	6bb6479690	satellite/repair: fix flakiness in tests Several tests using `(ECRepairer).Get()` have begun to exhibit flaky results. The tests are expecting to see failures in certain cases, but the failures are not present. It appears that the cause of this is that, sometimes, the fastest good nodes are able to satisfy the repairer (providing RequiredCount pieces) before the repairer is able to identify the problem scenario we have laid out. In this commit, we add an argument to `(ECRepairer).Get()` which specifies how many failure results are expected. In normal/production conditions, this parameter will be 0, meaning Get need not wait for any errors and should only report those that arrived while waiting for RequiredCount pieces (the existing behavior). But in these tests, we can request that Get() wait for enough results to see the errors we are expecting. Refs: https://github.com/storj/storj/issues/5593 Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8	2023-02-16 07:33:47 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
paul cannon	1854351da6	satellite/audit: teach Reporter about piecewise audits The Reporter is responsible for processing results from auditing operations, logging the results, disqualifying nodes that reached the maximum reverification count, and passing the results on to the reputation system. In this commit, we extend the Reporter so that it knows how to process the results of piecewise reverification audits. We also change most reporter-related tests so that reverifications happen as piecewise reverification audits, exercising the new code. Note that piecewise reverification audits are not yet being done outside of tests. In a later commit, we will switch from doing segmentwise reverifications to piecewise reverifications, as part of the audit-scaling effort. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9	2022-12-12 11:28:02 +00:00
Moby von Briesen	3501656e98	satellite/repair: Add flag to allow disabling reputation updates Reputation updates during repair currently consumes a lot of database resources. Sometimes increasing the rate of repair is more important than auditing a node based on whether they have or don't have the correct piece during repair. This is the job of the audit service. This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089 This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939 Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f	2022-11-24 08:31:11 -05:00
JT Olio	58a9c55f36	mod: bump dependencies - storj.io/common Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8	2022-10-19 17:01:53 +00:00
Egon Elbre	ff22fc7ddd	all: fix deprecated ioutil commands Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac	2022-10-11 15:27:29 +00:00
paul cannon	7f1cad6faf	satellite/repair: better handling of piece fetch errors We have an alert on `repair_too_many_nodes_failed` which fires too frequently. Every time so far, it has been because of a network blip of some nature on the satellite side. Satellite operators are expected to have other means in place for alerting on network problems and fixing them, so it's not necessary for the repair framework to act in that way. Instead, in this change, we change the way that `repair_too_many_nodes_failed` works. When a repair fails, we collect piece fetch errors by type and determine from them whether it looks like we are having network problems (most errors are connection failures, possibly also some successful connections which subsequently time out) or whether something else has happened. We will now only emit `repair_too_many_nodes_failed` when the outcome does not look like a network failure. In the network failure case, we will instead emit `repair_suspected_network_problem`. Refs: https://github.com/storj/storj/issues/4669 Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926	2022-09-23 09:35:06 +00:00
paul cannon	7d0885bbaa	satellite/repair: move over audit.Pieces This structure is entirely unused within the audit module, and is only used by repair code. Accordingly, this change moves the structure from audit code to repair code. Also, we take the opportunity here to rename the structure to something less generic. Refs: https://github.com/storj/storj/issues/4669 Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e	2022-09-22 16:43:03 +00:00
Márton Elek	4b1be6bf8e	storagenode/satellite: support different piece hash algorithms Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e	2022-08-23 18:15:06 +00:00
paul cannon	726c95160b	satellite/repair: avoid retrying GET_REPAIR incorrectly We retry a GET_REPAIR operation in one case, and one case only (as far as I can determine): when we are trying to connect to a node using its last known working IP and port combination rather than its supplied hostname, and we think the operation failed the first time because of a Dial failure. However, logs collected from storage node operators along with logs collected from satellites are strongly indicating that we are retrying GET_REPAIR operations in some cases even when we succeeded in connecting to the node the first time. This results in the node complaining loudly about being given a duplicate order limit (as it should), whereupon the satellite counts that as an unknown error and potentially penalizes the node. See discussion at https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36 . Investigation into this problem has revealed that `!piecestore.CloseError.Has(err)` may not be the best way of determining whether a problem occurred during Dial. In fact, it is probably downright Wrong. Handling of errors on a stream is somewhat complicated, but it would appear that there are several paths by which an RPC error originating on the remote side might show up during the Close() call, and would thus be labeled as a "CloseError". This change creates a new error class, repairer.ErrDialFailed, with which we will now wrap errors that _really definitely_ occurred during a Dial call. We will use this class to determine whether or not to retry a GET_REPAIR operation. The error will still also be wrapped with whatever wrapper classes it used to be wrapped with, so the potential for breakage here should be minimal. Refs: https://github.com/storj/storj/issues/4687 Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb	2022-07-18 05:11:56 +00:00
Erik van Velzen	f23d5eb5a1	satellite/repair: remove superfluous conditional Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68	2022-06-30 18:09:17 +00:00
paul cannon	fd01c6cc25	satellite/{repair,audit}: simplify reputation reporter Also, make it an interface so that the upcoming write cache can be dropped in to the same place. Change-Id: I2c286743825e647c0cef5b6578245391851fa10c	2022-05-10 14:04:43 +00:00
paul cannon	985ccbe721	satellite/repair: in dns redial, don't retry if CloseError To save load on DNS servers, the repair code first tries to dial the last known good ip and port for a node, and then falls back to a DNS lookup only if we fail to connect to the last known good ip and port. However, it looks like we are seeing errors during the client stream Close() call (probably due to quic-go code), and those are classified the same as errors encountered during Dial. The repairer code sees this error, assumes that we failed to contact the node, and retries- but since we did actually succeed in connecting the first time around, this results in submitting the same order limit (with the same serial number) to the storage node, which (rightfully) rejects it. So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in storj/uplink, this should avoid the double submission and avoid dinging nodes' suspension scores unfairly. See https://github.com/storj/storj/issues/4687. Also, moving the testsuite directory check up above check-monkit in the Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized and handled before everything breaks weirdly and seemingly randomly later on. Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655	2022-04-04 17:01:09 +00:00
Fadila Khadar	29fd36a20e	satellite/repairer: handle excluded countries For nodes in excluded areas, we don't necessarily want to remove them from the pointer, but we do want to increase the number of pieces in the segment in case those excluded area nodes go down. To do that, we increase the number of pieces repaired by the number of pieces in excluded areas. Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a	2022-03-14 10:59:36 -04:00
Márton Elek	b3675c14d4	repairer: log piece id in case of a repair error Change-Id: Ia8da2da491a6674f669e62148fa42538278119ba	2022-03-09 17:34:14 +00:00
paul cannon	12b3fb5fb0	cmd/satellite: add fetch-pieces command The "satellite fetch-pieces" command allows a satellite operator to fetch as many pieces of a segment as possible, along with their original order limits and hashes as provided by the storage nodes. The fetched pieces and associated info will be stored on in a specified folder as they are, rather than being RS-decoded or decrypted. It is hoped that this will allow easier debugging of certain one-off problems we've observed in the wild. Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a	2022-02-18 00:13:53 +00:00
Yingrong Zhao	1f8f7ebf06	satellite/{audit, reputation}: fix potential nodes reputation status inconsistency The original design had a flaw which can potentially cause discrepancy for nodes reputation status between reputations table and nodes table. In the event of a failure(network issue, db failure, satellite failure, etc.) happens between update to reputations table and update to nodes table, data can be out of sync. This PR tries to fix above issue by passing through node's reputation from the beginning of an audit/repair(this data is from nodes table) to the next update in reputation service. If the updated reputation status from the service is different from the existing node status, the service will try to update nodes table. In the case of a failure, the service will be able to try update nodes table again since it can see the discrepancy of the data. This will allow both tables to be in-sync eventually. Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122	2022-01-06 21:05:59 +00:00
Yingrong Zhao	336500c04d	satellite/repair: only record audit result if segment can be downloaded If satellite can't find enough nodes to successfully download a segment, it probably is not the fault of storage nodes. Change-Id: I681f66056df0bb940da9edb3a7dbb3658c0a56cb	2021-11-17 15:25:43 +00:00
Artur M. Wolff	89639199fe	satellite/repair/repairer: remove unused healthyMap This change removes unused healthyMap from (*SegmentRepairer).Repair. Change-Id: Ie80eefdb5b7125bf70986cb13462eee737af214c	2021-11-16 16:49:18 +01:00
Yingrong Zhao	35e4a87e60	satellite/repair: ignore expired segments at the beginning of the repair work Since we have changed the repair worker to also mark a node as audit failure if they return a not found error, we should ignore expired segments when possible Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c	2021-10-18 18:15:39 +00:00
Yaroslav Vorobiov	4b79f5ea86	satellite/repair: test if audit scores increases during repair Update repair tests to check if audit score increases for nodes that successfully send pieces during successfull and failed repairs. Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9	2021-10-04 19:39:13 +00:00
Yaroslav Vorobiov	469ae72c19	satellite/repair: update audit records during repair Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0	2021-09-24 00:48:13 +00:00
Cameron Ayer	51fdceafef	satellite/repair: increment repair_too_many_nodes_failed with 0 for redash alerting Change-Id: I990c8df7be30493705278b24954262834a1ed81f	2021-08-27 17:42:11 +00:00
Cameron Ayer	26f839a445	satellite/repair/repairer: if not enough nodes for repair order limits, increment metric and log as irreparable segment Change-Id: I4bd46f28d64278c8d463e885ad221aafb6ce7cf3	2021-08-27 13:42:28 +00:00
Cameron Ayer	dc69e1b16e	satellite/repair: use mutex instead of channel to collect download errors Change-Id: I3f958e9cc95126a25f73ccd105e614b51089edc5	2021-08-10 15:29:39 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Clement Sam	1f353f3231	segment/{metabase,repair}: change segment created_at column to not accept nulls This change adds a NOT NULL constraint to the created_at column in the segment table. All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc) Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52	2021-08-06 08:16:28 +00:00
Clement Sam	f06e7c5f60	segment/{metabase,repair}: add dedicated methods on metabase.Pieces This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates. Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc	2021-08-03 15:12:03 +00:00
Yingrong Zhao	f8914ccce0	satellite/{repair, overlay}: use reputation store in repair Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a	2021-07-28 13:22:05 -04:00
Cameron Ayer	449c873681	satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first Sometimes we see timeouts from DNS lookups when trying to do repair GETs. Solution: try using node's last IP and port first. If we can't connect, retry with DNS lookup. Change-Id: I59e223aebb436118779fb18378f6e09d072f12be	2021-07-21 13:13:06 +00:00
Cameron Ayer	373ba8fd27	satellite/repair/repairer: metrics for repair bytes uploaded and downloaded Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48	2021-07-21 09:23:19 +00:00
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Michał Niewrzał	a93e47514a	satellite: remove irreparabledb This is part of metaloop refactoring. We plan to remove irreparable at some point but there was not time for it. Now instead refatoring it for segmentloop its just easier to drop it. Later we still need to drop table with migration step. Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4	2021-06-17 07:20:15 +00:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00

1 2 3

130 Commits