Commit Graph

97 Commits

Author SHA1 Message Date
Michal Niewrzal
21c1e66a85 satellite/overlay: refactor ReliabilityCache to keep more data
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder

With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.

https://github.com/storj/storj/issues/5998

Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
2023-07-05 11:19:10 +02:00
Michal Niewrzal
cb9a7bdc71 satellite/repair/repairer: make DialTimeout configurable
This change makes dial timeout configurable and change it also from
defatul 20s to 5s. Main motivation is that during repair we often loose
lots of time to dial which eventually will fail. New timeout should be
still enough to dial but we will move forward quicker to next node if
that one will fail.

Timeout is also applied directly as context timeout in case we will
use noise of tcp fast open one day.

Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea
2023-06-16 12:23:25 +00:00
Michal Niewrzal
eabd9dd994 satellite/orders: remove unsed argument
Change-Id: I6c5221fc19f97ae6db5627d7239795ff663289e0
2023-05-22 14:35:08 +00:00
paul cannon
c856d45cc0 satellite/overlay: fix GetNodesNetworkInOrder
We were using the UploadSelectionCache previously, which does _not_ have
all nodes, or even all online nodes, in it. So all nodes with less than
MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended
for unknown audit errors, or nodes that have started graceful exit, were
all missing, and ended up having empty last_nets. Even with all that,
I'm kind of surprised how many nodes this involved, but using the upload
selection cache was definitely wrong.

This change uses the download selection cache instead, which excludes
nodes only when they are disqualified, gracefully exited (completely),
or offline.

Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1
2023-05-19 08:08:08 +00:00
paul cannon
de737bdee9 satellite/repair: add flag for de-clumping behavior
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.

Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.

Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
2023-05-18 21:02:36 +00:00
paul cannon
958d8676d0 satellite/overlay: remove unnecessary test helper
Change-Id: I8439eec4ed440f60353fc620ca906a917a03613c
2023-05-17 17:04:54 +00:00
paul cannon
75d10fe4fa satellite/overlay: use UploadSelectionCache for GetNodesNetworkInOrder
The query for GetNodesNetworkInOrder is causing far too much load on the
database. Since it is not critical that the repair checker have
perfectly up-to-date node network information, we can use a cache
instead.

Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878
2023-05-16 17:32:09 +00:00
igor gaidaienko
bc30deee11 satellite/repair: update repair test to use blake3 algo
Update repair test to also use blake3 hashing algorithm
https://github.com/storj/storj/issues/5649

Change-Id: Id8299576f8be4cfd84ddf9a6b852e653628ada72
2023-05-11 08:45:48 +00:00
Michal Niewrzal
36e046375c satellite/repair/checker: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3
2023-05-08 12:19:13 +00:00
Michal Niewrzal
1aa24b9f0d satellite/audit: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I9cec0ac454f40f19d52c078a8b1870c4d192bd7a
2023-04-24 15:52:11 +00:00
paul cannon
915f3952af satellite/repair: repair pieces on the same last_net
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.

This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).

Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
2023-04-06 17:34:25 +00:00
Egon Elbre
f5020de57c storagenode/blobstore: move blob store logic
The blobstore implementation is entirely related to storagenode, so the
rightful place is together with the storagenode implementation.

Fixes https://github.com/storj/storj/issues/5754

Change-Id: Ie6637b0262cf37af6c3e558556c7604d9dc3613d
2023-04-05 18:06:20 +00:00
paul cannon
9e6955cc17 satellite/repair: fix flaky TestFailedDataRepair and friends
The following tests should be made less flaky by this change:

- TestFailedDataRepair
- TestOfflineNodeDataRepair
- TestUnknownErrorDataRepair
- TestMissingPieceDataRepair_Succeed
- TestMissingPieceDataRepair
- TestCorruptDataRepair_Succeed
- TestCorruptDataRepair_Failed

This follows on to a change in commit 6bb64796. Nearly all tests in the
repair suite used to rely on events happening in a certain order. After
some of our performance work, those things no longer happen in that
expected order every time. This caused much flakiness.

The fix in 6bb64796 was sufficient for the tests operating directly on
an `*ECRepairer` instance, but not for the tests that make use of the
repairer by way of the repair queue and the repair worker. These tests
needed a different way to indicate the number of expected failures. This
change provides that different way.

Refs: https://github.com/storj/storj/issues/5736
Refs: https://github.com/storj/storj/issues/5718
Refs: https://github.com/storj/storj/issues/5715
Refs: https://github.com/storj/storj/issues/5609
Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e
2023-04-04 18:08:52 +00:00
paul cannon
6bb6479690 satellite/repair: fix flakiness in tests
Several tests using `(*ECRepairer).Get()` have begun to exhibit flaky
results. The tests are expecting to see failures in certain cases, but
the failures are not present. It appears that the cause of this is that,
sometimes, the fastest good nodes are able to satisfy the repairer
(providing RequiredCount pieces) before the repairer is able to identify
the problem scenario we have laid out.

In this commit, we add an argument to `(*ECRepairer).Get()` which
specifies how many failure results are expected. In normal/production
conditions, this parameter will be 0, meaning Get need not wait for
any errors and should only report those that arrived while waiting for
RequiredCount pieces (the existing behavior). But in these tests, we can
request that Get() wait for enough results to see the errors we are
expecting.

Refs: https://github.com/storj/storj/issues/5593
Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8
2023-02-16 07:33:47 +00:00
Moby von Briesen
3501656e98 satellite/repair: Add flag to allow disabling reputation updates
Reputation updates during repair currently consumes a lot of database
resources. Sometimes increasing the rate of repair is more important
than auditing a node based on whether they have or don't have the
correct piece during repair. This is the job of the audit service.

This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089
This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939

Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f
2022-11-24 08:31:11 -05:00
paul cannon
8b494f3740 satellite/audit: use db for auditor queue
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.

This logical database is safely partitionable from the rest of
satelliteDB.

Refs: https://github.com/storj/storj/issues/5251

Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
2022-11-22 14:04:00 +00:00
Cameron
74ddfab810 satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode
Also, return node email from overlaycache db DisqualifyNode to be used
in node events insertion

Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6
2022-11-04 15:18:31 +00:00
Egon Elbre
8b70f969b6 all: fix nolint directives
Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6
2022-10-11 18:31:20 +00:00
paul cannon
7f1cad6faf satellite/repair: better handling of piece fetch errors
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.

Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.

We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
2022-09-23 09:35:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
Márton Elek
4b1be6bf8e storagenode/satellite: support different piece hash algorithms
Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e
2022-08-23 18:15:06 +00:00
paul cannon
0dcc0a9ee0 satellite/reputation: reconfigure lambda and alpha
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.

For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at

    https://forum.storj.io/t/tuning-audit-scoring/14084

In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.

Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.

Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
2022-08-17 18:52:53 +00:00
paul cannon
37a4edbaff all: reformat comments as required by gofmt 1.19
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.

Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
2022-08-10 18:24:55 +00:00
paul cannon
2f20bbf4d8 satellite/reputation: add a reputation write cache
This should lower the amount of database load coming from
reputation updates.

Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949
2022-07-14 21:40:16 +00:00
Fadila Khadar
af1f0aa943 satellite/repair: lighten tests covering excluded countries
TestSegmentInExcludedCountriesRepair and TestSegmentInExcludedCountriesRepairIrreparable are using 20 storage nodes.
This change make them use 7 by adjusting the test redundancy scheme.

Change-Id: I1a44aa8b997d6edcc9a3305fdd0dac57e4d525b5
2022-05-04 07:51:59 +00:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Egon Elbre
566fc8ee25 satellite/repair: test inmemory/disk difference only once
We don't need to have every single test for both, only one for
each should be sufficient. For all other tests it doesn't matter
which one we use.

Change-Id: I9962206a4ee025d367332c29ea3e6bc9f0f9a1de
2022-03-29 14:08:13 +03:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Yingrong Zhao
336500c04d satellite/repair: only record audit result if segment can be downloaded
If satellite can't find enough nodes to successfully download a segment,
it probably is not the fault of storage nodes.

Change-Id: I681f66056df0bb940da9edb3a7dbb3658c0a56cb
2021-11-17 15:25:43 +00:00
Yingrong Zhao
35e4a87e60 satellite/repair: ignore expired segments at the beginning of the repair
work

Since we have changed the repair worker to also mark a node as audit
failure if they return a not found error, we should ignore expired
segments when possible

Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c
2021-10-18 18:15:39 +00:00
Yaroslav Vorobiov
4b79f5ea86 satellite/repair: test if audit scores increases during repair
Update repair tests to check if audit score increases for nodes
that successfully send pieces during successfull and failed repairs.

Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9
2021-10-04 19:39:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Cameron Ayer
449c873681 satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.

Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
2021-07-21 13:13:06 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Michał Niewrzał
a93e47514a satellite: remove irreparabledb
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.

Later we still need to drop table with migration step.

Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
2021-06-17 07:20:15 +00:00
Egon Elbre
10372afbe4 ci: fix lint errors
Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf
2021-05-17 13:37:31 +00:00
Michał Niewrzał
7944df20d6 storj: use multipart API
Change-Id: I10b401434e3e77468d12ecd225b41689568fd197
2021-04-26 13:15:09 +00:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Kaloyan Raev
035c393da0 satellite: update tests to pass etag.Reader to multipart.PutObjectPart
Change-Id: Ibe99357945ae7a91f5b5d4f87b83d425c9fa84a5
2021-03-29 13:18:11 +00:00
Michał Niewrzał
9a60011774 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: Ia90f29be432e207c4125f7f955c912978eabe59a
2021-02-04 09:38:08 +01:00
Kaloyan Raev
8d25c47897 satellite/repair: fix comment in TestRepairExpiredSegment
Change-Id: Ib91e81f6ba0a7f65daed157b78f7a1a108984930
2021-02-03 10:09:49 +02:00
Kaloyan Raev
038bd0a4da satellite/repair/repairer: fix repair for pending objects
https://storjlabs.atlassian.net/browse/PG-160

Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508
2021-02-02 19:50:10 +02:00
Kaloyan Raev
6f3d0c4ad5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/repair/repair_test.go
	satellite/repair/repairer/segments.go

Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906
2021-02-02 19:19:04 +02:00
Kaloyan Raev
339d1212cd satellite/repair: don't remove expired segments from repair queue
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.

Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.

So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.

Removing this check will also simplify how we migrate this code
correctly to the metabase.

Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
2021-02-02 16:13:01 +00:00
Michal Niewrzal
f7a31308db satellite/repair: enable TestRemoveExpiredSegmentFromQueue test
Change adds ability to set `now` time during test for repair.

Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f
2020-12-18 10:58:05 +00:00
Michal Niewrzal
70ba4deea9 satellite/repair/checker: adjust irreparable part of repair checker
Change-Id: I0732104a97ba18a5359de3966cd692677a0ff790
2020-12-17 14:11:22 +00:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00