satellite/repair: avoid retrying GET_REPAIR incorrectly

We retry a GET_REPAIR operation in one case, and one case only (as far
as I can determine): when we are trying to connect to a node using its
last known working IP and port combination rather than its supplied
hostname, and we think the operation failed the first time because of a
Dial failure.

However, logs collected from storage node operators along with logs
collected from satellites are strongly indicating that we are retrying
GET_REPAIR operations in some cases even when we succeeded in connecting
to the node the first time. This results in the node complaining loudly
about being given a duplicate order limit (as it should), whereupon the
satellite counts that as an unknown error and potentially penalizes the
node.

See discussion at
https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36
.

Investigation into this problem has revealed that
`!piecestore.CloseError.Has(err)` may not be the best way of determining
whether a problem occurred during Dial. In fact, it is probably
downright Wrong. Handling of errors on a stream is somewhat complicated,
but it would appear that there are several paths by which an RPC error
originating on the remote side might show up during the Close() call,
and would thus be labeled as a "CloseError".

This change creates a new error class, repairer.ErrDialFailed, with
which we will now wrap errors that _really definitely_ occurred during
a Dial call. We will use this class to determine whether or not to retry
a GET_REPAIR operation. The error will still also be wrapped with
whatever wrapper classes it used to be wrapped with, so the potential
for breakage here should be minimal.

Refs: https://github.com/storj/storj/issues/4687
Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb
This commit is contained in:
paul cannon 2022-07-05 22:02:49 -05:00 committed by Storj Robot
parent 2f20bbf4d8
commit 726c95160b

View File

@ -34,8 +34,13 @@ import (
"storj.io/uplink/private/piecestore"
)
// ErrPieceHashVerifyFailed is the errs class when a piece hash downloaded from storagenode fails to match the original hash.
var ErrPieceHashVerifyFailed = errs.Class("piece hashes don't match")
var (
// ErrPieceHashVerifyFailed is the errs class when a piece hash downloaded from storagenode fails to match the original hash.
ErrPieceHashVerifyFailed = errs.Class("piece hashes don't match")
// ErrDialFailed is the errs class when a failure happens during Dial.
ErrDialFailed = errs.Class("dial failure")
)
// ECRepairer allows the repairer to download, verify, and upload pieces from storagenodes.
type ECRepairer struct {
@ -58,7 +63,8 @@ func NewECRepairer(log *zap.Logger, dialer rpc.Dialer, satelliteSignee signing.S
}
func (ec *ECRepairer) dialPiecestore(ctx context.Context, n storj.NodeURL) (*piecestore.Client, error) {
return piecestore.Dial(ctx, ec.dialer, n, piecestore.DefaultConfig)
client, err := piecestore.Dial(ctx, ec.dialer, n, piecestore.DefaultConfig)
return client, ErrDialFailed.Wrap(err)
}
// Get downloads pieces from storagenodes using the provided order limits, and decodes those pieces into a segment.
@ -134,7 +140,7 @@ func (ec *ECRepairer) Get(ctx context.Context, limits []*pb.AddressedOrderLimit,
pieceReadCloser, _, _, err := ec.downloadAndVerifyPiece(ctx, limit, address, privateKey, "", pieceSize)
// if piecestore dial with last ip:port failed try again with node address
if triedLastIPPort && piecestore.Error.Has(err) && !piecestore.CloseError.Has(err) {
if triedLastIPPort && ErrDialFailed.Has(err) {
if pieceReadCloser != nil {
_ = pieceReadCloser.Close()
}