Commit Graph

50 Commits

Author SHA1 Message Date
Michal Niewrzal
cb9a7bdc71 satellite/repair/repairer: make DialTimeout configurable
This change makes dial timeout configurable and change it also from
defatul 20s to 5s. Main motivation is that during repair we often loose
lots of time to dial which eventually will fail. New timeout should be
still enough to dial but we will move forward quicker to next node if
that one will fail.

Timeout is also applied directly as context timeout in case we will
use noise of tcp fast open one day.

Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea
2023-06-16 12:23:25 +00:00
paul cannon
9e6955cc17 satellite/repair: fix flaky TestFailedDataRepair and friends
The following tests should be made less flaky by this change:

- TestFailedDataRepair
- TestOfflineNodeDataRepair
- TestUnknownErrorDataRepair
- TestMissingPieceDataRepair_Succeed
- TestMissingPieceDataRepair
- TestCorruptDataRepair_Succeed
- TestCorruptDataRepair_Failed

This follows on to a change in commit 6bb64796. Nearly all tests in the
repair suite used to rely on events happening in a certain order. After
some of our performance work, those things no longer happen in that
expected order every time. This caused much flakiness.

The fix in 6bb64796 was sufficient for the tests operating directly on
an `*ECRepairer` instance, but not for the tests that make use of the
repairer by way of the repair queue and the repair worker. These tests
needed a different way to indicate the number of expected failures. This
change provides that different way.

Refs: https://github.com/storj/storj/issues/5736
Refs: https://github.com/storj/storj/issues/5718
Refs: https://github.com/storj/storj/issues/5715
Refs: https://github.com/storj/storj/issues/5609
Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e
2023-04-04 18:08:52 +00:00
JT Olio
5b0cada4b3 repairer: monitor non-nil limit amount
Change-Id: I1a7b7a4a6716783449704cd8a7823090109a14de
2023-03-06 20:39:45 +00:00
paul cannon
20bcdeb8b1 satellite/repair: fix flaky test TestECREpairerGetOffline
It was possible to get into a situation where successfulPieces =
es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when
the succeeding gets all completed before the failures), whereupon the
last goroutine in the limiter would sit and wait forever for another
goroutine to finish.

This change corrects the handling of that situation.

As an aside, this is really pretty confusing code and we should think
about redoing the whole function.

Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69
2023-02-24 09:05:21 -06:00
paul cannon
6bb6479690 satellite/repair: fix flakiness in tests
Several tests using `(*ECRepairer).Get()` have begun to exhibit flaky
results. The tests are expecting to see failures in certain cases, but
the failures are not present. It appears that the cause of this is that,
sometimes, the fastest good nodes are able to satisfy the repairer
(providing RequiredCount pieces) before the repairer is able to identify
the problem scenario we have laid out.

In this commit, we add an argument to `(*ECRepairer).Get()` which
specifies how many failure results are expected. In normal/production
conditions, this parameter will be 0, meaning Get need not wait for
any errors and should only report those that arrived while waiting for
RequiredCount pieces (the existing behavior). But in these tests, we can
request that Get() wait for enough results to see the errors we are
expecting.

Refs: https://github.com/storj/storj/issues/5593
Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8
2023-02-16 07:33:47 +00:00
JT Olio
58a9c55f36 mod: bump dependencies
- storj.io/common

Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8
2022-10-19 17:01:53 +00:00
Egon Elbre
ff22fc7ddd all: fix deprecated ioutil commands
Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac
2022-10-11 15:27:29 +00:00
paul cannon
7f1cad6faf satellite/repair: better handling of piece fetch errors
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.

Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.

We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
2022-09-23 09:35:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
Márton Elek
4b1be6bf8e storagenode/satellite: support different piece hash algorithms
Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e
2022-08-23 18:15:06 +00:00
paul cannon
726c95160b satellite/repair: avoid retrying GET_REPAIR incorrectly
We retry a GET_REPAIR operation in one case, and one case only (as far
as I can determine): when we are trying to connect to a node using its
last known working IP and port combination rather than its supplied
hostname, and we think the operation failed the first time because of a
Dial failure.

However, logs collected from storage node operators along with logs
collected from satellites are strongly indicating that we are retrying
GET_REPAIR operations in some cases even when we succeeded in connecting
to the node the first time. This results in the node complaining loudly
about being given a duplicate order limit (as it should), whereupon the
satellite counts that as an unknown error and potentially penalizes the
node.

See discussion at
https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36
.

Investigation into this problem has revealed that
`!piecestore.CloseError.Has(err)` may not be the best way of determining
whether a problem occurred during Dial. In fact, it is probably
downright Wrong. Handling of errors on a stream is somewhat complicated,
but it would appear that there are several paths by which an RPC error
originating on the remote side might show up during the Close() call,
and would thus be labeled as a "CloseError".

This change creates a new error class, repairer.ErrDialFailed, with
which we will now wrap errors that _really definitely_ occurred during
a Dial call. We will use this class to determine whether or not to retry
a GET_REPAIR operation. The error will still also be wrapped with
whatever wrapper classes it used to be wrapped with, so the potential
for breakage here should be minimal.

Refs: https://github.com/storj/storj/issues/4687
Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb
2022-07-18 05:11:56 +00:00
paul cannon
985ccbe721 satellite/repair: in dns redial, don't retry if CloseError
To save load on DNS servers, the repair code first tries to dial the
last known good ip and port for a node, and then falls back to a DNS
lookup only if we fail to connect to the last known good ip and port.

However, it looks like we are seeing errors during the client stream
Close() call (probably due to quic-go code), and those are classified
the same as errors encountered during Dial. The repairer code sees this
error, assumes that we failed to contact the node, and retries- but
since we did actually succeed in connecting the first time around, this
results in submitting the same order limit (with the same serial number)
to the storage node, which (rightfully) rejects it.

So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in
storj/uplink, this should avoid the double submission and avoid dinging
nodes' suspension scores unfairly.

See https://github.com/storj/storj/issues/4687.

Also, moving the testsuite directory check up above check-monkit in the
Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized
and handled before everything breaks weirdly and seemingly randomly
later on.

Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655
2022-04-04 17:01:09 +00:00
Márton Elek
b3675c14d4 repairer: log piece id in case of a repair error
Change-Id: Ia8da2da491a6674f669e62148fa42538278119ba
2022-03-09 17:34:14 +00:00
paul cannon
12b3fb5fb0 cmd/satellite: add fetch-pieces command
The "satellite fetch-pieces" command allows a satellite operator to
fetch as many pieces of a segment as possible, along with their
original order limits and hashes as provided by the storage nodes. The
fetched pieces and associated info will be stored on in a specified
folder as they are, rather than being RS-decoded or decrypted.

It is hoped that this will allow easier debugging of certain one-off
problems we've observed in the wild.

Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a
2022-02-18 00:13:53 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Cameron Ayer
dc69e1b16e satellite/repair: use mutex instead of channel to collect download errors
Change-Id: I3f958e9cc95126a25f73ccd105e614b51089edc5
2021-08-10 15:29:39 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Cameron Ayer
449c873681 satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.

Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
2021-07-21 13:13:06 +00:00
Cameron Ayer
373ba8fd27 satellite/repair/repairer: metrics for repair bytes uploaded and downloaded
Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48
2021-07-21 09:23:19 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Cameron Ayer
3ea7aa2c7a satellite/repair/repairer: log piece hash verification failures
Piece hash verification failures during repair download are considered
audit failures, but we are not logging these occurrences. Now we log
them.

Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653
2021-05-14 15:03:15 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
JT Olio
b872fe52a1 satellite/repair: switch to piecestore.UploadReader
Change-Id: Ia99ad2cf5422e6ba1d98b32946740f9cadba7b6d
2020-09-01 09:26:54 -06:00
Egon Elbre
d8dcae3075 all: fix error checking
Change-Id: Ia0da1bbd6ce695139922f94096c2419281905e32
2020-07-16 19:13:14 +03:00
Moby von Briesen
e7e69f383a satellite/repair/repairer/ec.go: Add monkit tracing for ec repairer
Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and
ecrepairer.putPiece so we can get more accurate estimates of node
performance during repair.

Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d
2020-05-29 14:00:45 +00:00
Moby von Briesen
acf8b72cd0 satellite/repair/repairer: cut off long tail when minimum number of required uploads is met
This will speed up the Put step of repair by not waiting to time out for
a handful of slow nodes, at the expense of a slightly less durable
pointer. It will still repair to the optimal threshold, but not every
node that is selected will end up in the pointer.

Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651
2020-05-28 16:25:28 -04:00
Egon Elbre
ed627144ed all: use DialNodeURL throughout the codebase
Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c
2020-05-20 10:36:30 +00:00
Jess G
75b9a5971e
satellite: update log levels (#3851)
* satellite: update log levels

Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f

* log version on start up for every service

Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de

* Use monkit for tracking failed ip resolutions

Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2

* fix compile errors

Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf

* add request limit value to storage node rpc err

Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5

* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders

Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887

* fix build errs

Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
2020-04-15 12:32:22 -07:00
littleskunk
048ca4558f
satellite/repair: clean up logging (#3833)
Co-authored-by: Michal Niewrzal <michal@storj.io>
2020-03-30 11:59:56 +02:00
Egon Elbre
480ea1e4b5 satellite/repair/repairer: fix temporary file handling
Change-Id: Ice1a467510737b3375c018ae37b16431c7dffe9e
2020-03-27 21:36:23 +02:00
Moby von Briesen
a933bcc99a satellite/repair/repairer/ec.go: add option for downloading pieces onto disk instead of in memory during repair
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.

This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.

Updates tests to test repair for both in memory and to disk.

Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
2020-03-27 16:41:00 +00:00
paul cannon
79553059cb satellite/repair: put irreparable segments in irreparableDB
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).

Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.

This change also better differentiates some error cases from Repair()
for monitoring purposes.

Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
2020-03-09 21:45:16 +00:00
Egon Elbre
5342dd9fe6 go.mod: update uplink
Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811
2020-02-21 16:08:12 +02:00
Moby von Briesen
006a2824ba satellite/repair: lock monkit stats in checker and repairer
Change-Id: Ia10fc8da0177389a500359ce51d21a5806f3f7b1
2020-01-30 14:09:56 +00:00
Egon Elbre
082ec81714
uplink: move to storj.io/uplink (#3746) 2020-01-08 15:40:19 +02:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
Yingrong Zhao
7af42e3c10
satellite/metainfo, satellite/repair, uplink/eestream: add metric for download failed due to not enough pieces available (#3665) 2019-12-04 16:24:36 -05:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
littleskunk
7eb6724c92
logging: unify logging around satellite ID, node ID and piece ID (#3491)
* logging: unify logging around satellite ID, node ID and piece ID

* unify segment index
2019-11-05 22:04:07 +01:00
Yingrong Zhao
bfa6699e2c
satellite/repair: add timeout for repair download from a single node(#3418) 2019-10-30 16:31:08 -04:00
Natalie Villasana
5453886231 satellite/repair, uplink/ecclient: remove unused expiration arg from ec.Repair and ec.putPiece (#3416) 2019-10-30 11:35:00 -04:00
littleskunk
2a5526fcc4
satellite/repair: reduce timeout (#3302) 2019-10-18 13:43:24 +02:00
littleskunk
6e7607239c
satellite/repair: improve logging (#3287)
* satellite/repair: improve logging

* use Stringer wherever possible
2019-10-16 17:28:56 +02:00
Cameron
eb5413ae5e
defer close piecestore in downloadAndVerifyPiece (#3192) 2019-10-06 13:41:53 -04:00
Jeff Wendling
098cbc9c67 all: use pkg/rpc instead of pkg/transport
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.

most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.

a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.

Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
2019-09-25 15:37:06 -06:00
Yingrong Zhao
b37ea864b1
satellite/repair: delete pieces that failed piece hashes verification from pointer (#3051)
* add test

* add implementation

* remove todo comments

* modifies cooment

* fix linting

* typo oops
2019-09-16 13:13:24 -04:00
Yingrong Zhao
95aa33c964
satellite/repair/repairer: update audit status as failed after failing piece hash verification (#2997)
* update audit status as failed for nodes that failed piece hash verification

* remove comment

* fix lint error

* add test

* fix format

* use named return value for Get

* add comments

* add more better comment

* format
2019-09-13 12:21:20 -04:00
Maximillian von Briesen
fb10815229 Repair with hashes (#2925)
* add outline for ECRepairer

* add description of process in TODO comments

* begin download/getting hash for a single piece

* verify piece hash and order limit during download

* fix download piece

* begin filling out ESREpair. Get

* wip move ecclient.Repair to ecrepairer.Repair

* pass satellite signee into repairer

* reconstruct original stripe from pieces

* move rebuildStripe()

* calculate piece size differently, increment successful count

* fix shares slices initialization

* rename stripeData to segment

* do not pad reader in Repair()

* temp debug

* create unsafeRSScheme

* use decode reader

* rename file name to be all lowercase

* make repair downloader async

* declare condition variable inside Get method

* set downloadAndVerifyPiece's in-memory buffer to be share size

* update unusedLimits var

* address comments

* remove unnecessary comments

* move initialization of segmentRepaire to be outside of repairer service

* use ReadAll during download

* remove dots and move hashing to after validating for order limit signature

* wip test

* make sure files exactly at min threshold are repaired

* remove unused code

* use corrput data and write back to storagenode

* only create corrupted node and piece ids once

* add comment

* address nat's comment

* fix linting and checker_test

* update comment

* add comments

* remove "copied from ecclient" comments

* add clarification comments in ec.Repair
2019-09-06 15:20:36 -04:00