Commit Graph

87 Commits

Author SHA1 Message Date
paul cannon
25a5df9752 satellite/repair: don't reuse allNodeIDs
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.

This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.

Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.

Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
2023-06-06 10:28:04 -05:00
Michal Niewrzal
128b0a86e3 satellite/repair/repairer: repair pieces out of placement
Segment repairer should take into account segment 'placement' field
and remove or repair pieces from nodes that are outside this placement.

In case when after considering pieces out of placement we are still above
repair threshold we are only updating segment pieces to remove
problematic pieces. Otherwise we are doing regular repair.

https://github.com/storj/storj/issues/5896

Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59
2023-06-05 14:48:36 +00:00
Michal Niewrzal
eabd9dd994 satellite/orders: remove unsed argument
Change-Id: I6c5221fc19f97ae6db5627d7239795ff663289e0
2023-05-22 14:35:08 +00:00
paul cannon
de737bdee9 satellite/repair: add flag for de-clumping behavior
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.

Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.

Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
2023-05-18 21:02:36 +00:00
paul cannon
915f3952af satellite/repair: repair pieces on the same last_net
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.

This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).

Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
2023-04-06 17:34:25 +00:00
paul cannon
9e6955cc17 satellite/repair: fix flaky TestFailedDataRepair and friends
The following tests should be made less flaky by this change:

- TestFailedDataRepair
- TestOfflineNodeDataRepair
- TestUnknownErrorDataRepair
- TestMissingPieceDataRepair_Succeed
- TestMissingPieceDataRepair
- TestCorruptDataRepair_Succeed
- TestCorruptDataRepair_Failed

This follows on to a change in commit 6bb64796. Nearly all tests in the
repair suite used to rely on events happening in a certain order. After
some of our performance work, those things no longer happen in that
expected order every time. This caused much flakiness.

The fix in 6bb64796 was sufficient for the tests operating directly on
an `*ECRepairer` instance, but not for the tests that make use of the
repairer by way of the repair queue and the repair worker. These tests
needed a different way to indicate the number of expected failures. This
change provides that different way.

Refs: https://github.com/storj/storj/issues/5736
Refs: https://github.com/storj/storj/issues/5718
Refs: https://github.com/storj/storj/issues/5715
Refs: https://github.com/storj/storj/issues/5609
Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e
2023-04-04 18:08:52 +00:00
Michal Niewrzal
16b7901fde satellite/metabase: add piece size calculation to segment
This code is essentially replacement for eestream.CalcPieceSize. To call
eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not
trivial to get as it requires infectious.FEC. For example infectious.FEC
creation is visible on GE loop observer CPU profile because we were
doing this for each segment in DB.

New method was added to storj.Redundancy and here we are just wiring it
with metabase Segment.

BenchmarkSegmentPieceSize
BenchmarkSegmentPieceSize/eestream.CalcPieceSize
BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8         	    5822	    189189 ns/op	    9776 B/op	       8 allocs/op
BenchmarkSegmentPieceSize/segment.PieceSize
BenchmarkSegmentPieceSize/segment.PieceSize-8              	94721329	        11.49 ns/op	       0 B/op	       0 allocs/op

Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295
2023-02-22 11:04:02 +00:00
paul cannon
6bb6479690 satellite/repair: fix flakiness in tests
Several tests using `(*ECRepairer).Get()` have begun to exhibit flaky
results. The tests are expecting to see failures in certain cases, but
the failures are not present. It appears that the cause of this is that,
sometimes, the fastest good nodes are able to satisfy the repairer
(providing RequiredCount pieces) before the repairer is able to identify
the problem scenario we have laid out.

In this commit, we add an argument to `(*ECRepairer).Get()` which
specifies how many failure results are expected. In normal/production
conditions, this parameter will be 0, meaning Get need not wait for
any errors and should only report those that arrived while waiting for
RequiredCount pieces (the existing behavior). But in these tests, we can
request that Get() wait for enough results to see the errors we are
expecting.

Refs: https://github.com/storj/storj/issues/5593
Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8
2023-02-16 07:33:47 +00:00
paul cannon
1854351da6 satellite/audit: teach Reporter about piecewise audits
The Reporter is responsible for processing results from auditing
operations, logging the results, disqualifying nodes that reached
the maximum reverification count, and passing the results on to
the reputation system.

In this commit, we extend the Reporter so that it knows how to process
the results of piecewise reverification audits.

We also change most reporter-related tests so that reverifications
happen as piecewise reverification audits, exercising the new code.

Note that piecewise reverification audits are not yet being done outside
of tests. In a later commit, we will switch from doing segmentwise
reverifications to piecewise reverifications, as part of the
audit-scaling effort.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9
2022-12-12 11:28:02 +00:00
Moby von Briesen
3501656e98 satellite/repair: Add flag to allow disabling reputation updates
Reputation updates during repair currently consumes a lot of database
resources. Sometimes increasing the rate of repair is more important
than auditing a node based on whether they have or don't have the
correct piece during repair. This is the job of the audit service.

This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089
This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939

Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f
2022-11-24 08:31:11 -05:00
paul cannon
7f1cad6faf satellite/repair: better handling of piece fetch errors
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.

Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.

We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
2022-09-23 09:35:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
paul cannon
12b3fb5fb0 cmd/satellite: add fetch-pieces command
The "satellite fetch-pieces" command allows a satellite operator to
fetch as many pieces of a segment as possible, along with their
original order limits and hashes as provided by the storage nodes. The
fetched pieces and associated info will be stored on in a specified
folder as they are, rather than being RS-decoded or decrypted.

It is hoped that this will allow easier debugging of certain one-off
problems we've observed in the wild.

Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a
2022-02-18 00:13:53 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Yingrong Zhao
336500c04d satellite/repair: only record audit result if segment can be downloaded
If satellite can't find enough nodes to successfully download a segment,
it probably is not the fault of storage nodes.

Change-Id: I681f66056df0bb940da9edb3a7dbb3658c0a56cb
2021-11-17 15:25:43 +00:00
Artur M. Wolff
89639199fe satellite/repair/repairer: remove unused healthyMap
This change removes unused healthyMap from (*SegmentRepairer).Repair.

Change-Id: Ie80eefdb5b7125bf70986cb13462eee737af214c
2021-11-16 16:49:18 +01:00
Yingrong Zhao
35e4a87e60 satellite/repair: ignore expired segments at the beginning of the repair
work

Since we have changed the repair worker to also mark a node as audit
failure if they return a not found error, we should ignore expired
segments when possible

Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c
2021-10-18 18:15:39 +00:00
Yaroslav Vorobiov
4b79f5ea86 satellite/repair: test if audit scores increases during repair
Update repair tests to check if audit score increases for nodes
that successfully send pieces during successfull and failed repairs.

Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9
2021-10-04 19:39:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Cameron Ayer
51fdceafef satellite/repair: increment repair_too_many_nodes_failed with 0 for redash alerting
Change-Id: I990c8df7be30493705278b24954262834a1ed81f
2021-08-27 17:42:11 +00:00
Cameron Ayer
26f839a445 satellite/repair/repairer: if not enough nodes for repair order limits, increment metric and log as irreparable segment
Change-Id: I4bd46f28d64278c8d463e885ad221aafb6ce7cf3
2021-08-27 13:42:28 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Clement Sam
1f353f3231 segment/{metabase,repair}: change segment created_at column to not accept nulls
This change adds a NOT NULL constraint to the created_at column in the segment table.
All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc)

Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52
2021-08-06 08:16:28 +00:00
Clement Sam
f06e7c5f60 segment/{metabase,repair}: add dedicated methods on metabase.Pieces
This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates.

Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc
2021-08-03 15:12:03 +00:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Cameron Ayer
449c873681 satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.

Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
2021-07-21 13:13:06 +00:00
Cameron Ayer
373ba8fd27 satellite/repair/repairer: metrics for repair bytes uploaded and downloaded
Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48
2021-07-21 09:23:19 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Michał Niewrzał
a93e47514a satellite: remove irreparabledb
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.

Later we still need to drop table with migration step.

Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
2021-06-17 07:20:15 +00:00
Egon Elbre
10372afbe4 ci: fix lint errors
Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf
2021-05-17 13:37:31 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Michał Niewrzał
141444f6d6 satellite/repair/repairer: fix segmentAge metric
Change-Id: I146b3163aa1bfab5ee060298e6bf9822ca6820a0
2021-03-29 12:29:47 +00:00
Michał Niewrzał
27ae0d1f15 satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.

Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
2021-03-22 08:12:56 +00:00
Michał Niewrzał
908a96ae30 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791
2021-02-11 11:48:23 +01:00
Cameron Ayer
4a797baa73 satellite/repair/repairer: a new set of rs_scheme tagged metrics
Change-Id: Ibecd9265da881247eeb85ba185ee8877a7243777
2021-02-09 14:19:22 +00:00
Kaloyan Raev
038bd0a4da satellite/repair/repairer: fix repair for pending objects
https://storjlabs.atlassian.net/browse/PG-160

Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508
2021-02-02 19:50:10 +02:00
Kaloyan Raev
6f3d0c4ad5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/repair/repair_test.go
	satellite/repair/repairer/segments.go

Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906
2021-02-02 19:19:04 +02:00
Kaloyan Raev
339d1212cd satellite/repair: don't remove expired segments from repair queue
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.

Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.

So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.

Removing this check will also simplify how we migrate this code
correctly to the metabase.

Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
2021-02-02 16:13:01 +00:00
Kaloyan Raev
bafc6af992 ci: remove workaround for failing tests
Change-Id: I3eb673fae6c81bee17d7437cb870d5f5ba6978d5
2020-12-21 18:07:40 +02:00
Michal Niewrzal
f7a31308db satellite/repair: enable TestRemoveExpiredSegmentFromQueue test
Change adds ability to set `now` time during test for repair.

Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f
2020-12-18 10:58:05 +00:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Moby von Briesen
575f50df84 satellite/repair: Update repair override config to support multiple RS schemes.
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.

The new format for RS override values is
"k/o/n-override,k/o/n-override..."

Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
2020-11-23 18:01:15 +00:00
paul cannon
2b59640f18 cmd/satellite: ignore Canceled in exit from repair worker
Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.

Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.

Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
2020-11-17 21:37:59 +00:00
Cameron Ayer
da9f1f0611 satellite/repair: add monkit counter for segments below minimum required
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.

Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
2020-11-11 12:48:23 +00:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
2268cc1df3 all: fix linter complaints
Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95
2020-10-13 15:59:01 +03:00