storj

Author	SHA1	Message	Date
paul cannon	915f3952af	satellite/repair: repair pieces on the same last_net We avoid putting more than one piece of a segment on the same /24 network (or /64 for ipv6). However, it is possible for multiple pieces of the same segment to move to the same network over time. Nodes can change addresses, or segments could be uploaded with dev settings, etc. We will call such pieces "clumped", as they are clumped into the same net, and are much more likely to be lost or preserved together. This change teaches the repair checker to recognize segments which have clumped pieces, and put them in the repair queue. It also teaches the repair worker to repair such segments (treating clumped pieces as "retrievable but unhealthy"; i.e., they will be replaced on new nodes if possible). Refs: https://github.com/storj/storj/issues/5391 Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908	2023-04-06 17:34:25 +00:00
Egon Elbre	48256c91b5	storage: move errors to better locations Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d	2023-04-06 17:26:29 +03:00
Egon Elbre	f5020de57c	storagenode/blobstore: move blob store logic The blobstore implementation is entirely related to storagenode, so the rightful place is together with the storagenode implementation. Fixes https://github.com/storj/storj/issues/5754 Change-Id: Ie6637b0262cf37af6c3e558556c7604d9dc3613d	2023-04-05 18:06:20 +00:00
paul cannon	9e6955cc17	satellite/repair: fix flaky TestFailedDataRepair and friends The following tests should be made less flaky by this change: - TestFailedDataRepair - TestOfflineNodeDataRepair - TestUnknownErrorDataRepair - TestMissingPieceDataRepair_Succeed - TestMissingPieceDataRepair - TestCorruptDataRepair_Succeed - TestCorruptDataRepair_Failed This follows on to a change in commit `6bb64796`. Nearly all tests in the repair suite used to rely on events happening in a certain order. After some of our performance work, those things no longer happen in that expected order every time. This caused much flakiness. The fix in `6bb64796` was sufficient for the tests operating directly on an `*ECRepairer` instance, but not for the tests that make use of the repairer by way of the repair queue and the repair worker. These tests needed a different way to indicate the number of expected failures. This change provides that different way. Refs: https://github.com/storj/storj/issues/5736 Refs: https://github.com/storj/storj/issues/5718 Refs: https://github.com/storj/storj/issues/5715 Refs: https://github.com/storj/storj/issues/5609 Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e	2023-04-04 18:08:52 +00:00
Márton Elek	ffaf15a3b0	satellite/overlay: remove unused mail service from overlay It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail. Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service. Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7	2023-03-09 12:17:35 +00:00
JT Olio	5b0cada4b3	repairer: monitor non-nil limit amount Change-Id: I1a7b7a4a6716783449704cd8a7823090109a14de	2023-03-06 20:39:45 +00:00
paul cannon	20bcdeb8b1	satellite/repair: fix flaky test TestECREpairerGetOffline It was possible to get into a situation where successfulPieces = es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when the succeeding gets all completed before the failures), whereupon the last goroutine in the limiter would sit and wait forever for another goroutine to finish. This change corrects the handling of that situation. As an aside, this is really pretty confusing code and we should think about redoing the whole function. Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69	2023-02-24 09:05:21 -06:00
Michal Niewrzal	16b7901fde	satellite/metabase: add piece size calculation to segment This code is essentially replacement for eestream.CalcPieceSize. To call eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not trivial to get as it requires infectious.FEC. For example infectious.FEC creation is visible on GE loop observer CPU profile because we were doing this for each segment in DB. New method was added to storj.Redundancy and here we are just wiring it with metabase Segment. BenchmarkSegmentPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8 5822 189189 ns/op 9776 B/op 8 allocs/op BenchmarkSegmentPieceSize/segment.PieceSize BenchmarkSegmentPieceSize/segment.PieceSize-8 94721329 11.49 ns/op 0 B/op 0 allocs/op Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295	2023-02-22 11:04:02 +00:00
Egon Elbre	0cdef95d55	all: fix math/rand deprecations Change-Id: I4b966375697c0d409ce24cc7604f806973f8f22a	2023-02-17 15:05:54 +02:00
Michal Niewrzal	aba2f14595	satellite/metabase/rangedloop: few additions for monitoring Additional elements added: * monkit metric for observers methods like Start/Fork/Join/Finish to be able to check how much time those methods are taking * few more logs e.g. entries with processed range * segmentsProcessed metric to be able to check loop progress Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec	2023-02-17 08:46:00 +00:00
paul cannon	6bb6479690	satellite/repair: fix flakiness in tests Several tests using `(ECRepairer).Get()` have begun to exhibit flaky results. The tests are expecting to see failures in certain cases, but the failures are not present. It appears that the cause of this is that, sometimes, the fastest good nodes are able to satisfy the repairer (providing RequiredCount pieces) before the repairer is able to identify the problem scenario we have laid out. In this commit, we add an argument to `(ECRepairer).Get()` which specifies how many failure results are expected. In normal/production conditions, this parameter will be 0, meaning Get need not wait for any errors and should only report those that arrived while waiting for RequiredCount pieces (the existing behavior). But in these tests, we can request that Get() wait for enough results to see the errors we are expecting. Refs: https://github.com/storj/storj/issues/5593 Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8	2023-02-16 07:33:47 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
paul cannon	1854351da6	satellite/audit: teach Reporter about piecewise audits The Reporter is responsible for processing results from auditing operations, logging the results, disqualifying nodes that reached the maximum reverification count, and passing the results on to the reputation system. In this commit, we extend the Reporter so that it knows how to process the results of piecewise reverification audits. We also change most reporter-related tests so that reverifications happen as piecewise reverification audits, exercising the new code. Note that piecewise reverification audits are not yet being done outside of tests. In a later commit, we will switch from doing segmentwise reverifications to piecewise reverifications, as part of the audit-scaling effort. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9	2022-12-12 11:28:02 +00:00
Moby von Briesen	3501656e98	satellite/repair: Add flag to allow disabling reputation updates Reputation updates during repair currently consumes a lot of database resources. Sometimes increasing the rate of repair is more important than auditing a node based on whether they have or don't have the correct piece during repair. This is the job of the audit service. This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089 This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939 Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f	2022-11-24 08:31:11 -05:00
paul cannon	8b494f3740	satellite/audit: use db for auditor queue As part of the effort of splitting out the auditor workers to their own process, we are transitioning the communication between the auditor chore and the verification workers to a queue implemented in the database, rather than the sequence of in-memory queues we used to use. This logical database is safely partitionable from the rest of satelliteDB. Refs: https://github.com/storj/storj/issues/5251 Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc	2022-11-22 14:04:00 +00:00
Cameron	74ddfab810	satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode Also, return node email from overlaycache db DisqualifyNode to be used in node events insertion Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6	2022-11-04 15:18:31 +00:00
Cameron	f06da25c3d	satellite/overlay: add nodeevents.DB to satellite overlay service Add nodeevents.DB to satellite overlay service so we can insert node events into the nodeevents DB. Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156	2022-11-02 15:56:37 +00:00
JT Olio	58a9c55f36	mod: bump dependencies - storj.io/common Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8	2022-10-19 17:01:53 +00:00
Cameron	a52f766273	satellite/overlay: add email-sending functionality to overlay service We want to send emails to SNOs. Node status changes go through the overlay service, so it's a good place to add the mail service. Add the mailservice.Service, satellite address, and satellite name to overlay service. Also add feature flag --overlay.send-node-emails Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24	2022-10-13 18:01:05 +00:00
Egon Elbre	8b70f969b6	all: fix nolint directives Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6	2022-10-11 18:31:20 +00:00
Egon Elbre	ff22fc7ddd	all: fix deprecated ioutil commands Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac	2022-10-11 15:27:29 +00:00
Michal Niewrzal	5dc5f076c9	satellite/repair/checker: remove monitoring from fast methods It looks that monikt monitoring can give high CPU overhead for segments loop observer. With this code we are changing how monitoring is initialized for observer methods. This optimization affects mainly path where segment is healthy and doesn't require repair. Benchmark is also added to show difference between old and new approach. Benchmark against 'main': name old time/op new time/op delta RemoteSegment/Cockroach/healthy_segment-8 8.55µs ± 4% 1.37µs ± 6% -84.03% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RemoteSegment/Cockroach/healthy_segment-8 2.63kB ± 0% 0.17kB ± 0% -93.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RemoteSegment/Cockroach/healthy_segment-8 54.0 ± 0% 8.0 ± 0% -85.19% (p=0.008 n=5+5) Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18	2022-10-03 12:15:03 +00:00
Michal Niewrzal	1aecca1e76	satellite/repair/checker: tiny cleanup * unused slice removed * variable moved closer to place of use Change-Id: I86126b8337225d4b31cabf89bc9640add7409398	2022-09-26 11:20:10 +00:00
paul cannon	7f1cad6faf	satellite/repair: better handling of piece fetch errors We have an alert on `repair_too_many_nodes_failed` which fires too frequently. Every time so far, it has been because of a network blip of some nature on the satellite side. Satellite operators are expected to have other means in place for alerting on network problems and fixing them, so it's not necessary for the repair framework to act in that way. Instead, in this change, we change the way that `repair_too_many_nodes_failed` works. When a repair fails, we collect piece fetch errors by type and determine from them whether it looks like we are having network problems (most errors are connection failures, possibly also some successful connections which subsequently time out) or whether something else has happened. We will now only emit `repair_too_many_nodes_failed` when the outcome does not look like a network failure. In the network failure case, we will instead emit `repair_suspected_network_problem`. Refs: https://github.com/storj/storj/issues/4669 Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926	2022-09-23 09:35:06 +00:00
paul cannon	7d0885bbaa	satellite/repair: move over audit.Pieces This structure is entirely unused within the audit module, and is only used by repair code. Accordingly, this change moves the structure from audit code to repair code. Also, we take the opportunity here to rename the structure to something less generic. Refs: https://github.com/storj/storj/issues/4669 Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e	2022-09-22 16:43:03 +00:00
Márton Elek	4b1be6bf8e	storagenode/satellite: support different piece hash algorithms Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e	2022-08-23 18:15:06 +00:00
paul cannon	0dcc0a9ee0	satellite/reputation: reconfigure lambda and alpha This is in response to community feedback that our existing reputation calculation is too likely to disqualify storage nodes unfairly with extreme swings up and down. For details and analysis, please see the data_loss_vs_dq_chance_sim.py tool, the "tuning reputation further.ipynb" Jupyter notebook in the storj/datascience repository, and the discussion at https://forum.storj.io/t/tuning-audit-scoring/14084 In brief: changing the lambda and initial-alpha parameters in this way causes the swings in reputation to be smaller and less likely to put a node past the disqualification threshold unfairly. Note: this change will cause a one-time reset of all (non-disqualified) node reputations, because the new initial alpha value of 1000 is dramatically different, and the disqualification threshold is going to be much higher. Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3	2022-08-17 18:52:53 +00:00
paul cannon	37a4edbaff	all: reformat comments as required by gofmt 1.19 I don't know why the go people thought this was a good idea, because this automatic reformatting is bound to do the wrong thing sometimes, which is very annoying. But I don't see a way to turn it off, so best to get this change out of the way. Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3	2022-08-10 18:24:55 +00:00
Michal Niewrzal	6cc2052f47	satellite: fix segment loop observers metrics We made optimization for segment loop observers to avoid heavy monkit initialization on each call. It was applied to very often executed methods. Unfortunately we used wrong monkit method to track function times. Instead mon.Task we used mon.Func(). https://github.com/spacemonkeygo/monkit#how-it-works Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8	2022-08-10 14:13:16 +00:00
paul cannon	726c95160b	satellite/repair: avoid retrying GET_REPAIR incorrectly We retry a GET_REPAIR operation in one case, and one case only (as far as I can determine): when we are trying to connect to a node using its last known working IP and port combination rather than its supplied hostname, and we think the operation failed the first time because of a Dial failure. However, logs collected from storage node operators along with logs collected from satellites are strongly indicating that we are retrying GET_REPAIR operations in some cases even when we succeeded in connecting to the node the first time. This results in the node complaining loudly about being given a duplicate order limit (as it should), whereupon the satellite counts that as an unknown error and potentially penalizes the node. See discussion at https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36 . Investigation into this problem has revealed that `!piecestore.CloseError.Has(err)` may not be the best way of determining whether a problem occurred during Dial. In fact, it is probably downright Wrong. Handling of errors on a stream is somewhat complicated, but it would appear that there are several paths by which an RPC error originating on the remote side might show up during the Close() call, and would thus be labeled as a "CloseError". This change creates a new error class, repairer.ErrDialFailed, with which we will now wrap errors that _really definitely_ occurred during a Dial call. We will use this class to determine whether or not to retry a GET_REPAIR operation. The error will still also be wrapped with whatever wrapper classes it used to be wrapped with, so the potential for breakage here should be minimal. Refs: https://github.com/storj/storj/issues/4687 Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb	2022-07-18 05:11:56 +00:00
paul cannon	2f20bbf4d8	satellite/reputation: add a reputation write cache This should lower the amount of database load coming from reputation updates. Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949	2022-07-14 21:40:16 +00:00
Egon Elbre	48b0a65fbd	satellite/overlay: use ReadCache in Download/UploadSelectionCache sync2.ReadCache implements preemptive refreshing preventing stalling while it's being updated. Change-Id: Iee9ef36049b986f0e426c14a139b2bc9ac17fb53	2022-07-12 13:52:48 +03:00
Erik van Velzen	f23d5eb5a1	satellite/repair: remove superfluous conditional Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68	2022-06-30 18:09:17 +00:00
Michał Niewrzał	7a2d2a36ca	satellite: use more optimal monkit call for loop observers methods Recently we applied this optimization to metrics observer and time used by its method dropped from 12m to 3m for us1 (220m segments). It looks that it make sense to apply the same code to all observers. Change-Id: I05898aaacbd9bcdf21babc7be9955da1db57bdf2	2022-05-20 11:03:41 +00:00
Erik van Velzen	db1cc8ca95	satellite/repair/checker: buffer repair queue Integrate previous changes. Speed up the segment loop by batch inserting into repair queue. Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276	2022-05-12 16:28:05 +00:00
Erik van Velzen	928375a67c	satellite/repair/queue: buffer batch insert Implement a buffer for inserting repair items into the queue in a batch. Part of https://github.com/storj/storj/issues/4727 Change-Id: I718472b2f2b1f4993c3d6f15c44923776407155a	2022-05-11 09:02:20 +00:00
paul cannon	fd01c6cc25	satellite/{repair,audit}: simplify reputation reporter Also, make it an interface so that the upcoming write cache can be dropped in to the same place. Change-Id: I2c286743825e647c0cef5b6578245391851fa10c	2022-05-10 14:04:43 +00:00
Erik van Velzen	26f495f717	satellite/repair: implementation of batch insert Part of https://github.com/storj/storj/issues/4727 Change-Id: I44990a7614af26f8ee0be9c7aed496a1dd9e5df7	2022-05-09 12:41:22 +00:00
Erik van Velzen	10d71a8a3c	satellite/satellitedb: outline for batch insert Part of https://github.com/storj/storj/issues/4727 Change-Id: I1a9ad3b009f363e37f5e68e810074eecb7448db3	2022-05-09 11:39:52 +00:00
Fadila Khadar	af1f0aa943	satellite/repair: lighten tests covering excluded countries TestSegmentInExcludedCountriesRepair and TestSegmentInExcludedCountriesRepairIrreparable are using 20 storage nodes. This change make them use 7 by adjusting the test redundancy scheme. Change-Id: I1a44aa8b997d6edcc9a3305fdd0dac57e4d525b5	2022-05-04 07:51:59 +00:00
Yaroslav Vorobiov	3f47d19aa6	satellite/overlay: add disqualification reason Add disqualification reason to NodeDossier. Extend DB.DisqualifyNode with disqualification reason. Extend reputation Service.TestDisqualifyNode with disqualification reason. Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d	2022-04-20 13:29:31 +00:00
paul cannon	985ccbe721	satellite/repair: in dns redial, don't retry if CloseError To save load on DNS servers, the repair code first tries to dial the last known good ip and port for a node, and then falls back to a DNS lookup only if we fail to connect to the last known good ip and port. However, it looks like we are seeing errors during the client stream Close() call (probably due to quic-go code), and those are classified the same as errors encountered during Dial. The repairer code sees this error, assumes that we failed to contact the node, and retries- but since we did actually succeed in connecting the first time around, this results in submitting the same order limit (with the same serial number) to the storage node, which (rightfully) rejects it. So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in storj/uplink, this should avoid the double submission and avoid dinging nodes' suspension scores unfairly. See https://github.com/storj/storj/issues/4687. Also, moving the testsuite directory check up above check-monkit in the Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized and handled before everything breaks weirdly and seemingly randomly later on. Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655	2022-04-04 17:01:09 +00:00
Egon Elbre	566fc8ee25	satellite/repair: test inmemory/disk difference only once We don't need to have every single test for both, only one for each should be sufficient. For all other tests it doesn't matter which one we use. Change-Id: I9962206a4ee025d367332c29ea3e6bc9f0f9a1de	2022-03-29 14:08:13 +03:00
Qweder93	8b0988708a	satellite/repair: add test that confirms that repairer is ignoring copied segments Resolves https://github.com/storj/storj/issues/4485 Change-Id: Ic772643520124fe3f7eacf8b3bfbbb38982d4769	2022-03-16 09:00:34 +00:00
Fadila Khadar	29fd36a20e	satellite/repairer: handle excluded countries For nodes in excluded areas, we don't necessarily want to remove them from the pointer, but we do want to increase the number of pieces in the segment in case those excluded area nodes go down. To do that, we increase the number of pieces repaired by the number of pieces in excluded areas. Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a	2022-03-14 10:59:36 -04:00
Márton Elek	b3675c14d4	repairer: log piece id in case of a repair error Change-Id: Ia8da2da491a6674f669e62148fa42538278119ba	2022-03-09 17:34:14 +00:00
Fadila Khadar	e776c65172	satellite/checker: pieces in excluded countries are not healthy Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection. Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy. With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer. Another change will handle the logic at the repairer level. Fixes https://github.com/storj/team-metainfo/issues/95 Change-Id: I9231b32de117a116488de055a3e94efcabb46e81	2022-03-02 09:59:09 +00:00
paul cannon	12b3fb5fb0	cmd/satellite: add fetch-pieces command The "satellite fetch-pieces" command allows a satellite operator to fetch as many pieces of a segment as possible, along with their original order limits and hashes as provided by the storage nodes. The fetched pieces and associated info will be stored on in a specified folder as they are, rather than being RS-decoded or decrypted. It is hoped that this will allow easier debugging of certain one-off problems we've observed in the wild. Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a	2022-02-18 00:13:53 +00:00
Michał Niewrzał	bc161794fc	satellite/metabase: drop DeleteObjectLatestVersion method This method was never used, except tests. Change-Id: Idc1e69b2e2971995b5c4e6cf78a2b5fc69f39ad2	2022-02-02 14:33:48 +00:00
Yingrong Zhao	1f8f7ebf06	satellite/{audit, reputation}: fix potential nodes reputation status inconsistency The original design had a flaw which can potentially cause discrepancy for nodes reputation status between reputations table and nodes table. In the event of a failure(network issue, db failure, satellite failure, etc.) happens between update to reputations table and update to nodes table, data can be out of sync. This PR tries to fix above issue by passing through node's reputation from the beginning of an audit/repair(this data is from nodes table) to the next update in reputation service. If the updated reputation status from the service is different from the existing node status, the service will try to update nodes table. In the case of a failure, the service will be able to try update nodes table again since it can see the discrepancy of the data. This will allow both tables to be in-sync eventually. Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122	2022-01-06 21:05:59 +00:00

1 2 3 4 5 ...

254 Commits