storj

Author	SHA1	Message	Date
Michal Niewrzal	1d62dc63f5	satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation Currently, we have issue were while counting unhealthy pieces we are counting twice piece which is in excluded country and is outside segment placement. This can cause unnecessary repair. This change is also doing another step to move RepairExcludedCountryCodes from overlay config into repair package. Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f	2023-07-10 12:01:19 +02:00
Michal Niewrzal	cb9a7bdc71	satellite/repair/repairer: make DialTimeout configurable This change makes dial timeout configurable and change it also from defatul 20s to 5s. Main motivation is that during repair we often loose lots of time to dial which eventually will fail. New timeout should be still enough to dial but we will move forward quicker to next node if that one will fail. Timeout is also applied directly as context timeout in case we will use noise of tcp fast open one day. Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea	2023-06-16 12:23:25 +00:00
Michal Niewrzal	128b0a86e3	satellite/repair/repairer: repair pieces out of placement Segment repairer should take into account segment 'placement' field and remove or repair pieces from nodes that are outside this placement. In case when after considering pieces out of placement we are still above repair threshold we are only updating segment pieces to remove problematic pieces. Otherwise we are doing regular repair. https://github.com/storj/storj/issues/5896 Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59	2023-06-05 14:48:36 +00:00
paul cannon	de737bdee9	satellite/repair: add flag for de-clumping behavior It seems that the "what pieces are clumped" code does not work right, so this logic is causing repair overload or other repair failures. Hide it behind a flag while we figure out what is going on, so that repair can still work in the meantime. Change-Id: If83ef7895cba870353a67ab13573193d92fff80b	2023-05-18 21:02:36 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
Egon Elbre	48256c91b5	storage: move errors to better locations Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d	2023-04-06 17:26:29 +03:00
paul cannon	9e6955cc17	satellite/repair: fix flaky TestFailedDataRepair and friends The following tests should be made less flaky by this change: - TestFailedDataRepair - TestOfflineNodeDataRepair - TestUnknownErrorDataRepair - TestMissingPieceDataRepair_Succeed - TestMissingPieceDataRepair - TestCorruptDataRepair_Succeed - TestCorruptDataRepair_Failed This follows on to a change in commit `6bb64796`. Nearly all tests in the repair suite used to rely on events happening in a certain order. After some of our performance work, those things no longer happen in that expected order every time. This caused much flakiness. The fix in `6bb64796` was sufficient for the tests operating directly on an `*ECRepairer` instance, but not for the tests that make use of the repairer by way of the repair queue and the repair worker. These tests needed a different way to indicate the number of expected failures. This change provides that different way. Refs: https://github.com/storj/storj/issues/5736 Refs: https://github.com/storj/storj/issues/5718 Refs: https://github.com/storj/storj/issues/5715 Refs: https://github.com/storj/storj/issues/5609 Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e	2023-04-04 18:08:52 +00:00
Qweder93	d6a948f59d	satellite/repair : implemented ranged loop observer implemented observer and partial, created new structures to keep mon metrics remain in same way as in segment loop Change-Id: I209c126096c84b94d4717332e56238266f6cd004	2023-01-23 14:23:03 +00:00
Moby von Briesen	3501656e98	satellite/repair: Add flag to allow disabling reputation updates Reputation updates during repair currently consumes a lot of database resources. Sometimes increasing the rate of repair is more important than auditing a node based on whether they have or don't have the correct piece during repair. This is the job of the audit service. This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089 This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939 Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f	2022-11-24 08:31:11 -05:00
Erik van Velzen	f23d5eb5a1	satellite/repair: remove superfluous conditional Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68	2022-06-30 18:09:17 +00:00
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Michał Niewrzał	a93e47514a	satellite: remove irreparabledb This is part of metaloop refactoring. We plan to remove irreparable at some point but there was not time for it. Now instead refatoring it for segmentloop its just easier to drop it. Later we still need to drop table with migration step. Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4	2021-06-17 07:20:15 +00:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Egon Elbre	10372afbe4	ci: fix lint errors Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf	2021-05-17 13:37:31 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Michal Niewrzal	f7a31308db	satellite/repair: enable TestRemoveExpiredSegmentFromQueue test Change adds ability to set `now` time during test for repair. Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f	2020-12-18 10:58:05 +00:00
Michal Niewrzal	8d3ea9c251	satellite/repair/repairer: implement SegmentRepairer with metabase Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c	2020-12-17 10:47:21 +00:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Egon Elbre	7ce372c686	satellite/internalpb: add inspectors Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a	2020-10-30 13:28:17 +02:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
littleskunk	048ca4558f	satellite/repair: clean up logging (#3833 ) Co-authored-by: Michal Niewrzal <michal@storj.io>	2020-03-30 11:59:56 +02:00
Moby von Briesen	a933bcc99a	satellite/repair/repairer/ec.go: add option for downloading pieces onto disk instead of in memory during repair Add flag to satellite repairer, "InMemoryRepair" that allows the satellite to decide whether to download the entire segment being repaired into memory (this is what the satellite already does), or to download it into temporary files on disk that will be read from in the upload phase of repair. This should help with handling high repair traffic on satellites that cannot afford to spend 64mb of memory per repair worker. Updates tests to test repair for both in memory and to disk. Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194	2020-03-27 16:41:00 +00:00
paul cannon	79553059cb	satellite/repair: put irreparable segments in irreparableDB Previously, we were simply discarding rows from the repair queue when they couldn't be repaired (either because the overlay said too many nodes were down, or because we failed to download enough pieces). Now, such segments will be put into the irreparableDB for further and (hopefully) more focused attention. This change also better differentiates some error cases from Repair() for monitoring purposes. Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c	2020-03-09 21:45:16 +00:00
paul cannon	92d86fa044	satellite/repair: fix repair concurrency This new repair timeout (configured as TotalTimeout) will include both the time to download pieces and the time to upload pieces, as well as the time to pop the segment from the repair queue. This is a move from Github PR #3645. Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8	2020-02-24 19:57:09 +00:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Moby von Briesen	006a2824ba	satellite/repair: lock monkit stats in checker and repairer Change-Id: Ia10fc8da0177389a500359ce51d21a5806f3f7b1	2020-01-30 14:09:56 +00:00
Egon Elbre	8dea4f52db	satellite: add control panel Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b	2020-01-30 08:06:43 -05:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
littleskunk	71b58edb2c	satellite/repair: decrease repair interval Change-Id: Id9efdbfaa82521c35dc41e7a52b700522c197e77	2019-12-10 00:36:00 +00:00
littleskunk	c52c7275ad	satellite/repair: reduce upload timeout (#3597 )	2019-11-18 18:52:56 +01:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Yingrong Zhao	bfa6699e2c	satellite/repair: add timeout for repair download from a single node(#3418 )	2019-10-30 16:31:08 -04:00
littleskunk	2a5526fcc4	satellite/repair: reduce timeout (#3302 )	2019-10-18 13:43:24 +02:00
littleskunk	6e7607239c	satellite/repair: improve logging (#3287 ) * satellite/repair: improve logging * use Stringer wherever possible	2019-10-16 17:28:56 +02:00
Maximillian von Briesen	289cfe8ff2	satellite/repair: do not log "retrieved segment" if repair queue empty (#2995 )	2019-09-11 16:06:36 +03:00
Egon Elbre	a801fab66a	all: add archview annotations (#2964 )	2019-09-10 16:24:16 +03:00
Maximillian von Briesen	fb10815229	Repair with hashes (#2925 ) * add outline for ECRepairer * add description of process in TODO comments * begin download/getting hash for a single piece * verify piece hash and order limit during download * fix download piece * begin filling out ESREpair. Get * wip move ecclient.Repair to ecrepairer.Repair * pass satellite signee into repairer * reconstruct original stripe from pieces * move rebuildStripe() * calculate piece size differently, increment successful count * fix shares slices initialization * rename stripeData to segment * do not pad reader in Repair() * temp debug * create unsafeRSScheme * use decode reader * rename file name to be all lowercase * make repair downloader async * declare condition variable inside Get method * set downloadAndVerifyPiece's in-memory buffer to be share size * update unusedLimits var * address comments * remove unnecessary comments * move initialization of segmentRepaire to be outside of repairer service * use ReadAll during download * remove dots and move hashing to after validating for order limit signature * wip test * make sure files exactly at min threshold are repaired * remove unused code * use corrput data and write back to storagenode * only create corrupted node and piece ids once * add comment * address nat's comment * fix linting and checker_test * update comment * add comments * remove "copied from ecclient" comments * add clarification comments in ec.Repair	2019-09-06 15:20:36 -04:00
Egon Elbre	c8edeb0257	satellite/overlay: rename overlay.Cache to overlay.Service (#2717 )	2019-08-06 19:35:59 +03:00
Bill Thorp	fcbc9d71da	satellite/repair: add shouldDelete (#2702 ) * add shouldDelete to repair	2019-08-05 11:09:16 -04:00
Alexander Leitner	4632ab0a67	Delete irreparable segments (#2642 ) * Delete irreparable segments	2019-07-30 11:38:25 -04:00
Egon Elbre	e75813d094	satellite/repair: move segment repairer to satellite and simplify (#2651 )	2019-07-29 13:24:56 +02:00
Egon Elbre	5d0816430f	rename all the things (#2531 ) * rename pkg/linksharing to linksharing * rename pkg/httpserver to linksharing/httpserver * rename pkg/eestream to uplink/eestream * rename pkg/stream to uplink/stream * rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo * rename pkg/auth/signing to pkg/signing * rename pkg/storage to uplink/storage * rename pkg/accounting to satellite/accounting * rename pkg/audit to satellite/audit * rename pkg/certdb to satellite/certdb * rename pkg/discovery to satellite/discovery * rename pkg/overlay to satellite/overlay * rename pkg/datarepair to satellite/repair	2019-07-28 08:55:36 +03:00

44 Commits