docs/blueprints: design doc for changes needed to allow for scaling audit workers
Change-Id: I3e214c900a02ba15c2fb47558f0bc75310db2a7c
This commit is contained in:
parent
02924d0ded
commit
182b94aab2
178
docs/blueprints/audit-scaling.md
Normal file
178
docs/blueprints/audit-scaling.md
Normal file
@ -0,0 +1,178 @@
|
||||
# Audit Scaling
|
||||
|
||||
## Goal:
|
||||
When a node joins the network, it goes through a vetting process. One aspect of the node vetting process is auditing
|
||||
the nodes for pieces they should be storing. A node must successfully complete a certain number of audits to pass the
|
||||
vetting process. As more nodes join the network, the vetting process for each node takes longer because the satellite is
|
||||
limited by how many total audits it can perform. As we onboard new nodes, the vetting process
|
||||
takes increasingly longer for individual nodes. We need to be able to scale auditing depending on how many new nodes
|
||||
recently joined. At the moment, each satellite has a default of 2 concurrent audit workers.
|
||||
|
||||
However, the more data is uploaded, new nodes are more likely to get data, therefore more audits.
|
||||
A node is considered vetted when it receives a total of 100 audits. Unvetted/new nodes get 5% of data
|
||||
from new uploads. This means, less than 5% of all audits are going to these new nodes. We don't want to increase
|
||||
the percentage of uploads that go to new nodes, nor do we want to decrease the number of audits it takes
|
||||
to vet a node because we don't want to risk lowering the overall quality of nodes on the network, effecting network durability.
|
||||
|
||||
See this [dashboard](https://redash.datasci.storj.io/dashboard/vetting?p_FILTER=created_at&p_PERIOD=year) to compare vetting times.
|
||||
Here is a screenshot in case you can't access the dashboard. It shows percentiles for how long nodes took to get vetted, grouped by the month they were vetted.
|
||||
![vetted-times-grouped-by-month](./images/audit-scaling-month-vetted.png)
|
||||
|
||||
In terms of how we're going to scale the number of audit workers (both ones for regular audits and reverify audits),
|
||||
we plan to manually adjust them based on metrics. It currently takes on average between 2-3 months to get vetted.
|
||||
We may want to lower this to 1 month.
|
||||
|
||||
## Problem:
|
||||
We can't safely scale the number of audit workers with the current implementation of the containment system. The core
|
||||
issue is that there can only be one piece pending audit for a node at a time.
|
||||
The function [IncrementPending](https://github.com/storj/storj/blob/270204f352af0a7e3cb16c19dc988aa6a90563e0/satellite/satellitedb/containment.go#L66)
|
||||
will effectively dismiss any future pending audits for a node if they are not the same as the one already in question.
|
||||
```
|
||||
if !bytes.Equal(existingAudit.ExpectedShareHash, pendingAudit.ExpectedShareHash) {
|
||||
containment.db.log.Info("pending audit already exists",
|
||||
zap.String("node id", pendingAudit.NodeID.String()),
|
||||
zap.String("segment streamid", pendingAudit.StreamID.String()),
|
||||
zap.Uint64("segment position", pendingAudit.Position.Encode()),
|
||||
)
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
A pending audit is an audit for a piece that needs to be re-verified because the connection expired before the online node responded to the request.
|
||||
If a node has a pending audit, it is said to be contained. We will re-verify, or check for the piece again, a certain
|
||||
number of times before disqualifying the node. If it passes the recertification before the max retry limit, then it is
|
||||
removed from containment mode.
|
||||
|
||||
When there are multiple audit workers, if more than one audits a node within the same timeframe, a node can cheat the system by creating a pending audit only for the piece it does
|
||||
have while concealing the fact that it is missing other data. The likelihood of tricking the system increases with the
|
||||
number of workers concurrently auditing a node. Currently, we have two audit workers, with a small chance of auditing the
|
||||
same node within the same time period (the timeout window is set for 5 minutes). However, as we increase the number of
|
||||
workers, the likelihood also increases.
|
||||
|
||||
Here is an example. We will use two workers for simplicity. Let's say these two workers, A1 and A2, are auditing pieces
|
||||
P1 and P2 on node N, respectively. A1 and A2 audit N within the same 5 minute window. N has a correct version of P2 but not P1,
|
||||
so it closes the connection to A2 first (this would have to be a modified node program).
|
||||
A2 then puts P2 into pending audits and contains N. Once the connection
|
||||
to A1 is closed, A1 will attempt to place P1 in pending audits (in method IncrementPending) but since there is already a
|
||||
pending audit for N, it doesn't track P1 in pending audits. P2 will be reverified the next round and N can return the
|
||||
correct data and remove itself from containment mode. It can continue this process and indefinitely avoid disqualification,
|
||||
as long as the multiple workers audit it concurrently.
|
||||
|
||||
|
||||
Additionally, the node has a 'contained' flag that is set when it has pending audits, and unset when its audit score is
|
||||
modified. We don't use this flag for anything other than a status on the node dashboard, but this is still an
|
||||
inconsistency that will need to be addressed.
|
||||
|
||||
|
||||
## Solution:
|
||||
All audits should be allowed to add a piece to pending audits, and a successful audit removes only the corresponding entry.
|
||||
The contained flag will remain set to true as long as there are pending audits for the node.
|
||||
|
||||
A solution that we think will decouple the logic around regular audits and reverification audits is the following:
|
||||
- Rather than reverify nodes with pending audits and skipping them in the regular audit (see satellite/audit/worker.go:work),
|
||||
there will be a separate process that iterates over the data from the pending audits table and spins up workers to audit for those particular pieces.
|
||||
- A regular audit can insert an entry into the pending audits table.
|
||||
- A pending audit worker will select an entry to process from the pending audits table.
|
||||
- The result can be any of the audit outcomes or reverify count will be incremented if it times out again.
|
||||
- The next entry can be selected by oldest available (check last attempted time)
|
||||
- If a pending audit was attempted and reverification count is increased, don't try it again for x amount of time. Add field for attempted_at timestamp
|
||||
eg `WHERE attempted_at IS NULL OR attempted_at < now() - interval '6 hour'` in repairqueue.go:Select to find items to repair.
|
||||
- Contained nodes will no longer be selected for new uploads
|
||||
|
||||
### Implementation details:
|
||||
**Part 1. Implement new pending audit system**
|
||||
- Create a new db table called reverification_audits based on segment_pending_audits
|
||||
- switch primary key from nodeid to combination of (node_id, stream_id, position)
|
||||
- we don't need stripe_index since we want to download the whole piece (client.Download with offset 0)
|
||||
- add last_attempt: timestamp (nullable)
|
||||
- similar delete and read queries but using new primary key
|
||||
- migration plan: keep segment_pending_audits and drop the latter once this project is completed
|
||||
- create audit/reverifier.go methods
|
||||
- implement reverify method based on the original from audit.go/verifier.
|
||||
- We can target the specific piece in question, rather than checking all pieces in the segment.
|
||||
- create audit/reverifier.go chore
|
||||
- will use workers similar to those in audit/worker.go
|
||||
- Example:
|
||||
```go
|
||||
type Reverifier struct {
|
||||
log *zap.Logger
|
||||
verifier *Verifier // includes access to containment db
|
||||
reporter Reporter
|
||||
Loop *sync2.Cycle
|
||||
limiter *sync2.Limiter
|
||||
}
|
||||
|
||||
func NewReverifier(){}
|
||||
func (reverifier *Reverifier) Run(ctx context.Context) (err error){}
|
||||
func (reverifier *Reverifier) Close() error{}
|
||||
func (reverifier *Reverifier) process(ctx context.Context, pendingaudit *PendingAudit)) error {} // this will do the job of the current verifier.go:reverify method
|
||||
```
|
||||
- create method satellitedb/containment.go:Insert
|
||||
- similar to existing increment pending but remove query for existing rows, just insert new pending audit directly into db.
|
||||
- create method satellitedb/containment.go:Increment
|
||||
- increment the reverify count on the pending audit.
|
||||
- add new satellite/audit/containment interface methods
|
||||
- Get could follow the same query pattern as the repair queue satellitedb/repairqueue.go:select
|
||||
```go
|
||||
type Containment interface {
|
||||
Get //old
|
||||
IncrementPending //old
|
||||
Delete //old
|
||||
Select(ctx context.Context) (*PendingAudit, error) //new version of Get
|
||||
Insert(ctx context.Context, pendingAudit *PendingAudit) error //new
|
||||
Increment(ctx context.Context, pendingAudit *PendingAudit) error //new version of IncrementPending
|
||||
Remove(ctx context.Context, <key>) (bool, error) //new version of Delete
|
||||
```
|
||||
|
||||
**Part 2. Updates to existing audit system (decouple logic around regular vs re-verification audits)**
|
||||
- remove call to reverify and related logic
|
||||
- in audit/worker.go:work() we attempt to reverify nodes for the segment that are in containment mode.
|
||||
- Since we want verify and reverify to be in separate processes, we can remove all logic related to reverify here.
|
||||
- update audit.go/verifier
|
||||
- remove reference to containment from verifier struct
|
||||
- delete existing reverify method
|
||||
- remove satellitedb/containment.go methods that are no longer needed and switch any out that are still needed with the new versions
|
||||
- satellite/core: audit setup, add Reverifier *audit.Reverifier
|
||||
|
||||
**Part 3. Keep node containment status updated**
|
||||
- Update nodes table
|
||||
- add `field contained timestamp ( updatable, nullable )` (or bool)
|
||||
- Keep this status synced:
|
||||
- check for change in node contained status during reverifier chore
|
||||
- eg which nodes have pending audits before the chore vs after, then update the nodes table
|
||||
- Create a chore on the nodes table to verify the containment status
|
||||
- every so often, check if the nodes that are marked contained still have pending audits, and unmark them if they don't
|
||||
- this is to ensure we don't accidentally keep healthy nodes in a contained state
|
||||
- satellitedb/overlaycache.go:selectAllStorageNodesUpload
|
||||
- Add `AND NOT contained` to node selection query to filter out contained nodes from getting more data
|
||||
- satellitedb/repuations.go
|
||||
- populateUpdateNodeStats (line 508) remove logic around containment
|
||||
- updateNodeStats struct (line 641): remove contained field
|
||||
- satellite/overlay/service.go (if needed)
|
||||
- NodeDossier: make sure to update the contained flag if the pending audits table contains any for the node.
|
||||
|
||||
**QA**
|
||||
- Create a test plan
|
||||
- Test that the original cheater strategy is no longer viable
|
||||
|
||||
**Deployment**
|
||||
- Configure the number of audit workers for verifier and reverifier
|
||||
|
||||
**Post-deployment**
|
||||
- monitor vetting times for new nodes and scale audit workers accordingly
|
||||
|
||||
### Future Work
|
||||
Should we consider new nodes for audits at a different cadence from vetted nodes? This would require significant refactoring.
|
||||
- This has been suggested by the community too. I don't think we can make it work without changing the entire architecture of the product,
|
||||
but what we _could_ do is use a different cadence for new _segments_ instead of new nodes.
|
||||
Old segments tend to be on old nodes (not always, but there is a correlation) so that would get us at least some of the same benefit.
|
||||
|
||||
Auto scaling workers: Scaling of those should depend on the number of unvetted nodes and the average time in vetting.
|
||||
However, scaling for normal audits, should depend on the number of vetted nodes and amount of data on them.
|
||||
|
||||
For possible efficiency, a pending audit worker could query for one node ID and process all pending audits for that node at once.
|
||||
This could cause artificial clustering that we might want to avoid, but is a tradeoff we could reconsider for performance.
|
||||
Each worker would essentially have their own queue of pending audits to work from.
|
||||
- If each item results in a successful response (not offline or timeout), continue. Else, skip until the next iteration of the chore?
|
||||
This may give the node too much leeway in terms of getting away missing data by timing out (artificial spreading).
|
||||
- Query by unique node id and use that as the queue for querying all pending audits by node id (index on node id or first entry in a primary key)
|
BIN
docs/blueprints/images/audit-scaling-month-vetted.png
Normal file
BIN
docs/blueprints/images/audit-scaling-month-vetted.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 530 KiB |
Loading…
Reference in New Issue
Block a user