satellite/repair: avoid retrying GET_REPAIR incorrectly
We retry a GET_REPAIR operation in one case, and one case only (as far as I can determine): when we are trying to connect to a node using its last known working IP and port combination rather than its supplied hostname, and we think the operation failed the first time because of a Dial failure. However, logs collected from storage node operators along with logs collected from satellites are strongly indicating that we are retrying GET_REPAIR operations in some cases even when we succeeded in connecting to the node the first time. This results in the node complaining loudly about being given a duplicate order limit (as it should), whereupon the satellite counts that as an unknown error and potentially penalizes the node. See discussion at https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36 . Investigation into this problem has revealed that `!piecestore.CloseError.Has(err)` may not be the best way of determining whether a problem occurred during Dial. In fact, it is probably downright Wrong. Handling of errors on a stream is somewhat complicated, but it would appear that there are several paths by which an RPC error originating on the remote side might show up during the Close() call, and would thus be labeled as a "CloseError". This change creates a new error class, repairer.ErrDialFailed, with which we will now wrap errors that _really definitely_ occurred during a Dial call. We will use this class to determine whether or not to retry a GET_REPAIR operation. The error will still also be wrapped with whatever wrapper classes it used to be wrapped with, so the potential for breakage here should be minimal. Refs: https://github.com/storj/storj/issues/4687 Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb
This commit is contained in:
parent
2f20bbf4d8
commit
726c95160b
@ -34,8 +34,13 @@ import (
|
||||
"storj.io/uplink/private/piecestore"
|
||||
)
|
||||
|
||||
// ErrPieceHashVerifyFailed is the errs class when a piece hash downloaded from storagenode fails to match the original hash.
|
||||
var ErrPieceHashVerifyFailed = errs.Class("piece hashes don't match")
|
||||
var (
|
||||
// ErrPieceHashVerifyFailed is the errs class when a piece hash downloaded from storagenode fails to match the original hash.
|
||||
ErrPieceHashVerifyFailed = errs.Class("piece hashes don't match")
|
||||
|
||||
// ErrDialFailed is the errs class when a failure happens during Dial.
|
||||
ErrDialFailed = errs.Class("dial failure")
|
||||
)
|
||||
|
||||
// ECRepairer allows the repairer to download, verify, and upload pieces from storagenodes.
|
||||
type ECRepairer struct {
|
||||
@ -58,7 +63,8 @@ func NewECRepairer(log *zap.Logger, dialer rpc.Dialer, satelliteSignee signing.S
|
||||
}
|
||||
|
||||
func (ec *ECRepairer) dialPiecestore(ctx context.Context, n storj.NodeURL) (*piecestore.Client, error) {
|
||||
return piecestore.Dial(ctx, ec.dialer, n, piecestore.DefaultConfig)
|
||||
client, err := piecestore.Dial(ctx, ec.dialer, n, piecestore.DefaultConfig)
|
||||
return client, ErrDialFailed.Wrap(err)
|
||||
}
|
||||
|
||||
// Get downloads pieces from storagenodes using the provided order limits, and decodes those pieces into a segment.
|
||||
@ -134,7 +140,7 @@ func (ec *ECRepairer) Get(ctx context.Context, limits []*pb.AddressedOrderLimit,
|
||||
|
||||
pieceReadCloser, _, _, err := ec.downloadAndVerifyPiece(ctx, limit, address, privateKey, "", pieceSize)
|
||||
// if piecestore dial with last ip:port failed try again with node address
|
||||
if triedLastIPPort && piecestore.Error.Has(err) && !piecestore.CloseError.Has(err) {
|
||||
if triedLastIPPort && ErrDialFailed.Has(err) {
|
||||
if pieceReadCloser != nil {
|
||||
_ = pieceReadCloser.Close()
|
||||
}
|
||||
|
Loading…
Reference in New Issue
Block a user