Commit Graph

47 Commits

Author SHA1 Message Date
Márton Elek
58b98bc335 satellite/repair: repair is configurable to work only on included/excluded placements
This patch finishes the placement aware repair.

We already introduced the parameters to select only the jobs for specific placements, the remaining part is just to configure the exclude/include rules. + a full e2e unit test.

Change-Id: I223ba84e8ab7481a53e5a444596c7a5ae51573c5
2023-09-27 14:54:06 +00:00
Márton Elek
c44e3d78d8 satellite/satellitedb: repairqueue.Select uses placement constraints
Change-Id: I59739926f8f6c5eaca3199369d4c5d88a9c08be8
2023-09-25 10:14:25 +00:00
Michal Niewrzal
47a4d4986d satellite/repair: enable declumping by default
This feature flag was disabled by default to test it slowly. Its enabled
for some time on one production satellite and test satellites without
any issue. We can enable it by default in code.

Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c
2023-07-17 15:02:35 +00:00
Michal Niewrzal
1d62dc63f5 satellite/repair/repairer: fix NumHealthyInExcludedCountries calculation
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.

This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.

Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
2023-07-10 12:01:19 +02:00
Michal Niewrzal
cb9a7bdc71 satellite/repair/repairer: make DialTimeout configurable
This change makes dial timeout configurable and change it also from
defatul 20s to 5s. Main motivation is that during repair we often loose
lots of time to dial which eventually will fail. New timeout should be
still enough to dial but we will move forward quicker to next node if
that one will fail.

Timeout is also applied directly as context timeout in case we will
use noise of tcp fast open one day.

Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea
2023-06-16 12:23:25 +00:00
Michal Niewrzal
128b0a86e3 satellite/repair/repairer: repair pieces out of placement
Segment repairer should take into account segment 'placement' field
and remove or repair pieces from nodes that are outside this placement.

In case when after considering pieces out of placement we are still above
repair threshold we are only updating segment pieces to remove
problematic pieces. Otherwise we are doing regular repair.

https://github.com/storj/storj/issues/5896

Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59
2023-06-05 14:48:36 +00:00
paul cannon
de737bdee9 satellite/repair: add flag for de-clumping behavior
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.

Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.

Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
2023-05-18 21:02:36 +00:00
Michal Niewrzal
36e046375c satellite/repair/checker: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3
2023-05-08 12:19:13 +00:00
Egon Elbre
48256c91b5 storage: move errors to better locations
Change-Id: Ia44570949a8f6bb50220dc838c5b6aa21e851a4d
2023-04-06 17:26:29 +03:00
paul cannon
9e6955cc17 satellite/repair: fix flaky TestFailedDataRepair and friends
The following tests should be made less flaky by this change:

- TestFailedDataRepair
- TestOfflineNodeDataRepair
- TestUnknownErrorDataRepair
- TestMissingPieceDataRepair_Succeed
- TestMissingPieceDataRepair
- TestCorruptDataRepair_Succeed
- TestCorruptDataRepair_Failed

This follows on to a change in commit 6bb64796. Nearly all tests in the
repair suite used to rely on events happening in a certain order. After
some of our performance work, those things no longer happen in that
expected order every time. This caused much flakiness.

The fix in 6bb64796 was sufficient for the tests operating directly on
an `*ECRepairer` instance, but not for the tests that make use of the
repairer by way of the repair queue and the repair worker. These tests
needed a different way to indicate the number of expected failures. This
change provides that different way.

Refs: https://github.com/storj/storj/issues/5736
Refs: https://github.com/storj/storj/issues/5718
Refs: https://github.com/storj/storj/issues/5715
Refs: https://github.com/storj/storj/issues/5609
Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e
2023-04-04 18:08:52 +00:00
Qweder93
d6a948f59d satellite/repair : implemented ranged loop observer
implemented observer and partial, created new structures to keep mon
metrics remain in same way as in segment loop

Change-Id: I209c126096c84b94d4717332e56238266f6cd004
2023-01-23 14:23:03 +00:00
Moby von Briesen
3501656e98 satellite/repair: Add flag to allow disabling reputation updates
Reputation updates during repair currently consumes a lot of database
resources. Sometimes increasing the rate of repair is more important
than auditing a node based on whether they have or don't have the
correct piece during repair. This is the job of the audit service.

This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089
This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939

Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f
2022-11-24 08:31:11 -05:00
Erik van Velzen
f23d5eb5a1 satellite/repair: remove superfluous conditional
Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68
2022-06-30 18:09:17 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Michał Niewrzał
a93e47514a satellite: remove irreparabledb
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.

Later we still need to drop table with migration step.

Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
2021-06-17 07:20:15 +00:00
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Egon Elbre
10372afbe4 ci: fix lint errors
Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf
2021-05-17 13:37:31 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Michal Niewrzal
f7a31308db satellite/repair: enable TestRemoveExpiredSegmentFromQueue test
Change adds ability to set `now` time during test for repair.

Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f
2020-12-18 10:58:05 +00:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Egon Elbre
004e610d0f satellite/internalpb: move datarepair.pb to internal
Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d
2020-10-30 13:28:14 +02:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
littleskunk
048ca4558f
satellite/repair: clean up logging (#3833)
Co-authored-by: Michal Niewrzal <michal@storj.io>
2020-03-30 11:59:56 +02:00
Moby von Briesen
a933bcc99a satellite/repair/repairer/ec.go: add option for downloading pieces onto disk instead of in memory during repair
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.

This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.

Updates tests to test repair for both in memory and to disk.

Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
2020-03-27 16:41:00 +00:00
paul cannon
79553059cb satellite/repair: put irreparable segments in irreparableDB
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).

Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.

This change also better differentiates some error cases from Repair()
for monitoring purposes.

Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
2020-03-09 21:45:16 +00:00
paul cannon
92d86fa044 satellite/repair: fix repair concurrency
This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.

This is a move from Github PR #3645.

Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
2020-02-24 19:57:09 +00:00
Jeff Wendling
7999d24f81 all: use monkit v3
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.

graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.

it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.

Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
2020-02-05 23:53:17 +00:00
Moby von Briesen
006a2824ba satellite/repair: lock monkit stats in checker and repairer
Change-Id: Ia10fc8da0177389a500359ce51d21a5806f3f7b1
2020-01-30 14:09:56 +00:00
Egon Elbre
8dea4f52db satellite: add control panel
Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b
2020-01-30 08:06:43 -05:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
littleskunk
71b58edb2c satellite/repair: decrease repair interval
Change-Id: Id9efdbfaa82521c35dc41e7a52b700522c197e77
2019-12-10 00:36:00 +00:00
littleskunk
c52c7275ad
satellite/repair: reduce upload timeout (#3597) 2019-11-18 18:52:56 +01:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
Yingrong Zhao
bfa6699e2c
satellite/repair: add timeout for repair download from a single node(#3418) 2019-10-30 16:31:08 -04:00
littleskunk
2a5526fcc4
satellite/repair: reduce timeout (#3302) 2019-10-18 13:43:24 +02:00
littleskunk
6e7607239c
satellite/repair: improve logging (#3287)
* satellite/repair: improve logging

* use Stringer wherever possible
2019-10-16 17:28:56 +02:00
Maximillian von Briesen
289cfe8ff2 satellite/repair: do not log "retrieved segment" if repair queue empty (#2995) 2019-09-11 16:06:36 +03:00
Egon Elbre
a801fab66a
all: add archview annotations (#2964) 2019-09-10 16:24:16 +03:00
Maximillian von Briesen
fb10815229 Repair with hashes (#2925)
* add outline for ECRepairer

* add description of process in TODO comments

* begin download/getting hash for a single piece

* verify piece hash and order limit during download

* fix download piece

* begin filling out ESREpair. Get

* wip move ecclient.Repair to ecrepairer.Repair

* pass satellite signee into repairer

* reconstruct original stripe from pieces

* move rebuildStripe()

* calculate piece size differently, increment successful count

* fix shares slices initialization

* rename stripeData to segment

* do not pad reader in Repair()

* temp debug

* create unsafeRSScheme

* use decode reader

* rename file name to be all lowercase

* make repair downloader async

* declare condition variable inside Get method

* set downloadAndVerifyPiece's in-memory buffer to be share size

* update unusedLimits var

* address comments

* remove unnecessary comments

* move initialization of segmentRepaire to be outside of repairer service

* use ReadAll during download

* remove dots and move hashing to after validating for order limit signature

* wip test

* make sure files exactly at min threshold are repaired

* remove unused code

* use corrput data and write back to storagenode

* only create corrupted node and piece ids once

* add comment

* address nat's comment

* fix linting and checker_test

* update comment

* add comments

* remove "copied from ecclient" comments

* add clarification comments in ec.Repair
2019-09-06 15:20:36 -04:00
Egon Elbre
c8edeb0257
satellite/overlay: rename overlay.Cache to overlay.Service (#2717) 2019-08-06 19:35:59 +03:00
Bill Thorp
fcbc9d71da
satellite/repair: add shouldDelete (#2702)
* add shouldDelete to repair
2019-08-05 11:09:16 -04:00
Alexander Leitner
4632ab0a67
Delete irreparable segments (#2642)
* Delete irreparable segments
2019-07-30 11:38:25 -04:00
Egon Elbre
e75813d094 satellite/repair: move segment repairer to satellite and simplify (#2651) 2019-07-29 13:24:56 +02:00
Egon Elbre
5d0816430f
rename all the things (#2531)
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair
2019-07-28 08:55:36 +03:00