docs/testplan: Testplan for Scaling Audit Worker (#5335 )

This testplan is going to cover the changes to allow for scaling audit workers. It will go over the scaling audit worker design doc found under blueprints.

Co-authored-by: Antonio Franco (He/Him) <antonio@storj.io>

2023-01-04 18:49:53 +01:00

18 KiB

Raw Blame History

Scaling Audit Worker Testplan

Background

This testplan is going to cover the changes to allow for scaling audit workers. It will go over the design doc seen here - Scaling Audit Worker Design Doc

Test Scenario	Test Case	Description	Comments
Audit Scaling General	Positive/Negative	Test for when node joins network, nodes must past a certain number of audits and if the node passes a certain number of audits then the node is fully vetted else it is not
	Vetted Status Check	A node should only be considered vetted if it receives a total of 100 audits, so nodes under said number should not be vetted && The number of audits a new node requires to be vetted should never fall below aforementioned total, so quality of nodes on network doesn't decrease and thus negatively affect network durability
	Audit Scaling	If new nodes are onboarded then there shouldn't be limited audit scaling because of satellites with the introduction of said measure for audit scaling
	Unvetted Nodes	New nodes should follow a set percentage of data from new uploads (5%)
	Vetting Times	Vetting time on average now should (according to metrics) should be 1 month to scale audit workers
	Avoiding DQ 1	If a node tries to get another pending audit because there are multiple audit workers working on the same node it should be doable, otherwise there can be a situation where the node can continuously bypass DQ	explained in blueprint lines 50-57
	Avoiding DQ 2	In the aforementioned case, if there is a successful audit it should remove only the corresponding entry and not all pending audits
	Contained Flag	As long as a node has pending audits then the node should have a 'contained' flag set to true, and if there are no pending audits then vice versa NOT contained
	Decouple Logic- Pending Audits Reverification	With the new solution, nodes should not be reverified with pending audits and skipped in regular audits, there should be a separate process that iterates over the data from the pending audits table and spins up workers to audit those particular pieces ALL IN ALL Regular audits should be able to insert an entry into the pending audits table but it should not be able to be used to reverify AND there should be separate workers for regular audits & pending audits
	Audit outcome	During an audit outcome if a node is audited with no problem then audit count should increase by 1, if it needs to reverify because of ex. time out, then it should be placed into pending audits table by the audit worker
	Reverify outcome	When a pending audit worker selects an entry from the pending audits table if it is audited with no problem then audit count should increase by 1, however if there is an issue and the process times out then the reverify count will be incremented, and this pending audit should in the future get selected again by a pending audit worker until audit count increases by 1
	Decouple Logic- Next reverification	Depending on the outcome of said pending audit (ex. selected 1st time from pending audits table or has existing reverification count), then for the former, it should just wait for the next available audit worker or be selected at random, while for the latter it should wait for an x amount of time and a field for attempted_at timestamp should be created
	Failure Prevention	The retry interval mentioned previously should not lead to an increase of artificial spreading of failures and be exploited to avoid disqualification	explained in the blueprint, with more spread-out failures, node operators would be able to abuse this f.e even if their nodes suffered significant data loss, b/c of spread-out failures this would be spread out and nodes that should have been DQ would scrape by
	Verify Updates (Unit Tests etc.)	F.E reverifier, audit setup...new pending audits should be inserted directly into db	explained in blueprint, lines 82-147
	Node Disqualification	If a node has a pending audit and it wasn't able to reverify a certain number of times, then the node should be disqualified
	New Pending Audit System 1	All audits should be allowed to add a piece to pending audits, and a successful audit removes only the corresponding entry
	New Pending Audit System 2	Multiple audit workers on the slim chance should be able to work on the same node since all audits are allowed to add a piece to pending audits without causing any issues ex. bypassing DQ
	Contained Nodes	Contained nodes should have their containment status on the nodes table and include the "field contained timestamp"
	Reverifier Chore Check	The reverifier chore should cause a change in contained node status f.e if nodes don't have pending audits after the chore when it did before
	Containment Status Chore Check	This chore should check for nodes that are marked contained and check if those nodes still have pending audits, otherwise, this chore would then unmark said nodes from containment

18 KiB Raw Blame History

Scaling Audit Worker Testplan

Background

18 KiB

Raw Blame History