storj/satellite/repair
paul cannon 726c95160b satellite/repair: avoid retrying GET_REPAIR incorrectly
We retry a GET_REPAIR operation in one case, and one case only (as far
as I can determine): when we are trying to connect to a node using its
last known working IP and port combination rather than its supplied
hostname, and we think the operation failed the first time because of a
Dial failure.

However, logs collected from storage node operators along with logs
collected from satellites are strongly indicating that we are retrying
GET_REPAIR operations in some cases even when we succeeded in connecting
to the node the first time. This results in the node complaining loudly
about being given a duplicate order limit (as it should), whereupon the
satellite counts that as an unknown error and potentially penalizes the
node.

See discussion at
https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36
.

Investigation into this problem has revealed that
`!piecestore.CloseError.Has(err)` may not be the best way of determining
whether a problem occurred during Dial. In fact, it is probably
downright Wrong. Handling of errors on a stream is somewhat complicated,
but it would appear that there are several paths by which an RPC error
originating on the remote side might show up during the Close() call,
and would thus be labeled as a "CloseError".

This change creates a new error class, repairer.ErrDialFailed, with
which we will now wrap errors that _really definitely_ occurred during
a Dial call. We will use this class to determine whether or not to retry
a GET_REPAIR operation. The error will still also be wrapped with
whatever wrapper classes it used to be wrapped with, so the potential
for breakage here should be minimal.

Refs: https://github.com/storj/storj/issues/4687
Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb
2022-07-18 05:11:56 +00:00
..
checker satellite/overlay: use ReadCache in Download/UploadSelectionCache 2022-07-12 13:52:48 +03:00
queue satellite/repair/checker: buffer repair queue 2022-05-12 16:28:05 +00:00
repairer satellite/repair: avoid retrying GET_REPAIR incorrectly 2022-07-18 05:11:56 +00:00
priority_test.go satellite/repair: test inmemory/disk difference only once 2022-03-29 14:08:13 +03:00
priority.go satellite/repair: clamp totalNodes to 100 or higher 2020-12-30 10:39:14 -06:00
repair_test.go satellite/reputation: add a reputation write cache 2022-07-14 21:40:16 +00:00
repair.go satellite/repair: move test files (#2649) 2019-07-28 12:15:34 +03:00