storj/docs/testplan/scaling-audit-worker.md
nadimhq 63639e862b
docs/testplan: Testplan for Scaling Audit Worker (#5335)
This testplan is going to cover the changes to allow for scaling audit workers. It will go over the scaling audit worker design doc found under blueprints.

Co-authored-by: Antonio Franco (He/Him) <antonio@storj.io>
2023-01-04 18:49:53 +01:00

18 KiB

Scaling Audit Worker Testplan

 

Background

This testplan is going to cover the changes to allow for scaling audit workers. It will go over the design doc seen here - Scaling Audit Worker Design Doc

 

Test Scenario Test Case Description Comments
Audit Scaling General Positive/Negative Test for when node joins network, nodes must past a certain number of audits and if the node passes a certain number of audits then the node is fully vetted else it is not
Vetted Status Check A node should only be considered vetted if it receives a total of 100 audits, so nodes under said number should not be vetted && The number of audits a new node requires to be vetted should never fall below aforementioned total, so quality of nodes on network doesn't decrease and thus negatively affect network durability
Audit Scaling If new nodes are onboarded then there shouldn't be limited audit scaling because of satellites with the introduction of said measure for audit scaling
Unvetted Nodes New nodes should follow a set percentage of data from new uploads (5%)
Vetting Times Vetting time on average now should (according to metrics) should be 1 month to scale audit workers
Avoiding DQ 1 If a node tries to get another pending audit because there are multiple audit workers working on the same node it should be doable, otherwise there can be a situation where the node can continuously bypass DQ explained in blueprint lines 50-57
Avoiding DQ 2 In the aforementioned case, if there is a successful audit it should remove only the corresponding entry and not all pending audits
Contained Flag As long as a node has pending audits then the node should have a 'contained' flag set to true, and if there are no pending audits then vice versa NOT contained
Decouple Logic- Pending Audits Reverification With the new solution, nodes should not be reverified with pending audits and skipped in regular audits, there should be a separate process that iterates over the data from the pending audits table and spins up workers to audit those particular pieces ALL IN ALL Regular audits should be able to insert an entry into the pending audits table but it should not be able to be used to reverify AND there should be separate workers for regular audits & pending audits
Audit outcome During an audit outcome if a node is audited with no problem then audit count should increase by 1, if it needs to reverify because of ex. time out, then it should be placed into pending audits table by the audit worker
Reverify outcome When a pending audit worker selects an entry from the pending audits table if it is audited with no problem then audit count should increase by 1, however if there is an issue and the process times out then the reverify count will be incremented, and this pending audit should in the future get selected again by a pending audit worker until audit count increases by 1
Decouple Logic- Next reverification Depending on the outcome of said pending audit (ex. selected 1st time from pending audits table or has existing reverification count), then for the former, it should just wait for the next available audit worker or be selected at random, while for the latter it should wait for an x amount of time and a field for attempted_at timestamp should be created
Failure Prevention The retry interval mentioned previously should not lead to an increase of artificial spreading of failures and be exploited to avoid disqualification explained in the blueprint, with more spread-out failures, node operators would be able to abuse this f.e even if their nodes suffered significant data loss, b/c of spread-out failures this would be spread out and nodes that should have been DQ would scrape by
Verify Updates (Unit Tests etc.) F.E reverifier, audit setup...new pending audits should be inserted directly into db explained in blueprint, lines 82-147
Node Disqualification If a node has a pending audit and it wasn't able to reverify a certain number of times, then the node should be disqualified
New Pending Audit System 1 All audits should be allowed to add a piece to pending audits, and a successful audit removes only the corresponding entry
New Pending Audit System 2 Multiple audit workers on the slim chance should be able to work on the same node since all audits are allowed to add a piece to pending audits without causing any issues ex. bypassing DQ
Contained Nodes Contained nodes should have their containment status on the nodes table and include the "field contained timestamp"
Reverifier Chore Check The reverifier chore should cause a change in contained node status f.e if nodes don't have pending audits after the chore when it did before
Containment Status Chore Check This chore should check for nodes that are marked contained and check if those nodes still have pending audits, otherwise, this chore would then unmark said nodes from containment