storj/satellite/gracefulexit/common.go
paul cannon 72189330fd satellite/gracefulexit: revamp graceful exit
Currently, graceful exit is a complicated subsystem that keeps a queue
of all pieces expected to be on a node, and asks the node to transfer
those pieces to other nodes one by one. The complexity of the system
has, unfortunately, led to numerous bugs and unexpected behaviors.

We have decided to remove this entire subsystem and restructure graceful
exit as follows:

* Nodes will signal their intent to exit gracefully
* The satellite will not send any new pieces to gracefully exiting nodes
* Pieces on gracefully exiting nodes will be considered by the repair
  subsystem as "retrievable but unhealthy". They will be repaired off of
  the exiting node as needed.
* After one month (with an appropriately high online score), the node
  will be considered exited, and held amounts for the node will be
  released. The repair worker will continue to fetch pieces from the
  node as long as the node stays online.
* If, at the end of the month, a node's online score is below a certain
  threshold, its graceful exit will fail.

Refs: https://github.com/storj/storj/issues/6042
Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b
2023-09-27 08:40:01 +00:00

56 lines
3.2 KiB
Go

// Copyright (C) 2019 Storj Labs, Inc.
// See LICENSE for copying information.
package gracefulexit
import (
"time"
"github.com/spacemonkeygo/monkit/v3"
"github.com/zeebo/errs"
)
var (
// Error is the default error class for graceful exit package.
Error = errs.Class("gracefulexit")
// ErrNodeNotFound is returned if a graceful exit entry for a node does not exist in database.
ErrNodeNotFound = errs.Class("graceful exit node not found")
// ErrAboveOptimalThreshold is returned if a graceful exit entry for a node has more pieces than required.
ErrAboveOptimalThreshold = errs.Class("segment has more pieces than required")
mon = monkit.Package()
)
// Config for the chore.
type Config struct {
Enabled bool `help:"whether or not graceful exit is enabled on the satellite side." default:"true"`
TimeBased bool `help:"whether graceful exit will be determined by a period of time, rather than by instructing nodes to transfer one piece at a time" default:"false"`
NodeMinAgeInMonths int `help:"minimum age for a node on the network in order to initiate graceful exit" default:"6" testDefault:"0"`
// these items only apply when TimeBased=false:
ChoreBatchSize int `help:"size of the buffer used to batch inserts into the transfer queue." default:"500" testDefault:"10"`
ChoreInterval time.Duration `help:"how often to run the transfer queue chore." releaseDefault:"30s" devDefault:"10s" testDefault:"$TESTINTERVAL"`
UseRangedLoop bool `help:"whether use GE observer with ranged loop." default:"true"`
EndpointBatchSize int `help:"size of the buffer used to batch transfer queue reads and sends to the storage node." default:"300" testDefault:"100"`
MaxFailuresPerPiece int `help:"maximum number of transfer failures per piece." default:"5"`
OverallMaxFailuresPercentage int `help:"maximum percentage of transfer failures per node." default:"10"`
MaxInactiveTimeFrame time.Duration `help:"maximum inactive time frame of transfer activities per node." default:"168h" testDefault:"10s"`
RecvTimeout time.Duration `help:"the minimum duration for receiving a stream from a storage node before timing out" default:"2h" testDefault:"1m"`
MaxOrderLimitSendCount int `help:"maximum number of order limits a satellite sends to a node before marking piece transfer failed" default:"10" testDefault:"3"`
AsOfSystemTimeInterval time.Duration `help:"interval for AS OF SYSTEM TIME clause (crdb specific) to read from db at a specific time in the past" default:"-10s" testDefault:"-1µs"`
TransferQueueBatchSize int `help:"batch size (crdb specific) for deleting and adding items to the transfer queue" default:"1000"`
// these items only apply when TimeBased=true:
GracefulExitDurationInDays int `help:"number of days it takes to execute a passive graceful exit" default:"30" testDefault:"1"`
OfflineCheckInterval time.Duration `help:"how frequently to check uptime ratio of gracefully-exiting nodes" default:"30m" testDefault:"10s"`
MinimumOnlineScore float64 `help:"a gracefully exiting node will fail GE if it falls below this online score (compare AuditHistoryConfig.OfflineThreshold)" default:"0.8"`
}