docs: audit-scaling: clarify process structure

and clarify some related implementation details. Most notably, this change clarifies that the verification audit workers and reverification audit workers are meant to live in a process or processes separate from the satellite core, and outlines an extra queue that will be used for communication with the core. It's not entirely clear to me that this is the right approach; we would save some fairly significant implementation time by leaving both types of worker in the core. That would make it necessary to reconfigure and restart the core when we wanted to change the number of verification and/or reverification workers, and scaling would be limited to the computational capacity of the core vm, but maybe those are acceptable conditions. Another option would be to leave the Verifier workers in the core, and having a separate process for Reverifiers. That would be sort of a middle way between the two above. Change-Id: Ida12e423b94ef6088733b13d5cc58bdb78f2e93f
2022-10-26 11:53:08 -05:00 · 2022-10-26 11:53:08 -05:00 · 0c3dd44490
commit 0c3dd44490
parent c54c45c9c7
1 changed files with 32 additions and 8 deletions
--- a/docs/blueprints/audit-scaling.md
+++ b/docs/blueprints/audit-scaling.md
@ -63,11 +63,20 @@ Additionally, the node has a 'contained' flag that is set when it has pending au
 modified. We don't use this flag for anything other than a status on the node dashboard, but this is still an
 inconsistency that will need to be addressed.

+Finally, we don't have as much flexibility in scaling audits as we might like, since in the current system, all audits are
+performed in the core process (because the decisions about what to audit come by way of the metainfo loop). If we had some
+sort of interprocess queue for both initial audits and reverification audits, we could make break out both of those to
+separate processes, which would be scalable independent of each other and without reconfiguring+restarting the satellite
+core.

 ## Solution:
 All audits should be allowed to add a piece to pending audits, and a successful audit removes only the corresponding entry.
 The contained flag will remain set to true as long as there are pending audits for the node.

+New interprocess queues will be created which will communicate audit jobs (both verifications and reverifications) to audit
+workers, which will live outside of the satellite core. These queues can be implemented and managed similarly to the existing
+repair queue.
+
 A solution that we think will decouple the logic around regular audits and reverification audits is the following:
 - Rather than reverify nodes with pending audits and skipping them in the regular audit (see satellite/audit/worker.go:work),
  there will be a separate process that iterates over the data from the pending audits table and spins up workers to audit for those particular pieces.
@ -80,11 +89,16 @@ A solution that we think will decouple the logic around regular audits and rever
 - Contained nodes will no longer be selected for new uploads

 ### Implementation details:
-**Part 1. Implement new pending audit system**
- Create a new db table called reverification_audits based on segment_pending_audits
-  - switch primary key from nodeid to combination of (node_id, stream_id, position)
-  - we don't need stripe_index since we want to download the whole piece (client.Download with offset 0) 
-  - add last_attempt: timestamp (nullable)
+**Part 1. Implement new pending reverifications system**
+- Create a new db table called `verification_audits`
+  - primary key (`stream_id`, `position`)
+  - additional columns `expires_at`, `encrypted_size`, `inserted_at`, and `last_attempt`
+  - secondary index on `last_attempt`
+- Create a new db table called `reverification_audits` based on `segment_pending_audits`
+  - switch primary key from nodeid to combination of (`node_id`, `stream_id`, `position`)
+  - we don't need `stripe_index` since we want to download the whole piece (`client.Download` with offset 0)
+  - add `last_attempt: timestamp` (nullable)
+  - secondary index on `last_attempt`
  - similar delete and read queries but using new primary key
  - migration plan: keep segment_pending_audits and drop the latter once this project is completed
 - create audit/reverifier.go methods
@ -128,11 +142,18 @@ func (reverifier *Reverifier) process(ctx context.Context, pendingaudit *Pending
 - remove call to reverify and related logic
  - in audit/worker.go:work() we attempt to reverify nodes for the segment that are in containment mode. 
  - Since we want verify and reverify to be in separate processes, we can remove all logic related to reverify here.
- update audit.go/verifier
+- change the audit chore to put segments into the `verification_audits` queue from the sampling reservoir
+- change the audit worker to get segments from the `verification_audits` queue
+- update audit/verifier.go
  - remove reference to containment from verifier struct
  - delete existing reverify method
 - remove satellitedb/containment.go methods that are no longer needed and switch any out that are still needed with the new versions
- satellite/core: audit setup, add Reverifier *audit.Reverifier
+- satellite/core.go: remove audit setup (except the Reporter, if it is still needed by existing code)
+- satellite/auditor.go:
+  - create this process (a `Peer` like `satellite.Repairer`)
+  - add `Reverifier *audit.Reverifier` to the audit setup.
+  - the number of verifier workers and the number of reverifier workers should both be configurable, with "0"
+    being an acceptable value for either.

 **Part 3. Keep node containment status updated**
 - Update nodes table
@ -156,10 +177,13 @@ func (reverifier *Reverifier) process(ctx context.Context, pendingaudit *Pending
 - Test that the original cheater strategy is no longer viable

 **Deployment**
- Configure the number of audit workers for verifier and reverifier
+- Configure the number of verifier audit workers and reverifier workers
+- Set up new audit process to be deployed and scaled as appropriate
+- During the transition time, the old system and the new system can safely coexist

 **Post-deployment**
 - monitor vetting times for new nodes and scale audit workers accordingly
+- the old `segment_pending_audits` queue/table and any remaining contents can be dropped

 ### Future Work
 Should we consider new nodes for audits at a different cadence from vetted nodes? This would require significant refactoring.