When a node joins the network, it goes through a vetting process. One aspect of the node vetting process is auditing
the nodes for pieces they should be storing. A node must successfully complete a certain number of audits to pass the
vetting process. As more nodes join the network, the vetting process for each node takes longer because the satellite is
limited by how many total audits it can perform. As we onboard new nodes, the vetting process
takes increasingly longer for individual nodes. We need to be able to scale auditing depending on how many new nodes
recently joined. At the moment, each satellite has a default of 2 concurrent audit workers.
However, the more data is uploaded, new nodes are more likely to get data, therefore more audits.
A node is considered vetted when it receives a total of 100 audits. Unvetted/new nodes get 5% of data
from new uploads. This means, less than 5% of all audits are going to these new nodes. We don't want to increase
the percentage of uploads that go to new nodes, nor do we want to decrease the number of audits it takes
to vet a node because we don't want to risk lowering the overall quality of nodes on the network, effecting network durability.
See this [dashboard](https://redash.datasci.storj.io/dashboard/vetting?p_FILTER=created_at&p_PERIOD=year) to compare vetting times.
Here is a screenshot in case you can't access the dashboard. It shows percentiles for how long nodes took to get vetted, grouped by the month they were vetted.
A pending audit is an audit for a piece that needs to be re-verified because the connection expired before the online node responded to the request.
If a node has a pending audit, it is said to be contained. We will re-verify, or check for the piece again, a certain
number of times before disqualifying the node. If it passes the recertification before the max retry limit, then it is
removed from containment mode.
When there are multiple audit workers, if more than one audits a node within the same timeframe, a node can cheat the system by creating a pending audit only for the piece it does
have while concealing the fact that it is missing other data. The likelihood of tricking the system increases with the
number of workers concurrently auditing a node. Currently, we have two audit workers, with a small chance of auditing the
same node within the same time period (the timeout window is set for 5 minutes). However, as we increase the number of
workers, the likelihood also increases.
Here is an example. We will use two workers for simplicity. Let's say these two workers, A1 and A2, are auditing pieces
P1 and P2 on node N, respectively. A1 and A2 audit N within the same 5 minute window. N has a correct version of P2 but not P1,
so it closes the connection to A2 first (this would have to be a modified node program).
A2 then puts P2 into pending audits and contains N. Once the connection
to A1 is closed, A1 will attempt to place P1 in pending audits (in method IncrementPending) but since there is already a
pending audit for N, it doesn't track P1 in pending audits. P2 will be reverified the next round and N can return the
correct data and remove itself from containment mode. It can continue this process and indefinitely avoid disqualification,
as long as the multiple workers audit it concurrently.
Additionally, the node has a 'contained' flag that is set when it has pending audits, and unset when its audit score is
modified. We don't use this flag for anything other than a status on the node dashboard, but this is still an
A solution that we think will decouple the logic around regular audits and reverification audits is the following:
- Rather than reverify nodes with pending audits and skipping them in the regular audit (see satellite/audit/worker.go:work),
there will be a separate process that iterates over the data from the pending audits table and spins up workers to audit for those particular pieces.
- A regular audit can insert an entry into the pending audits table.
- A pending audit worker will select an entry to process from the pending audits table.
- The result can be any of the audit outcomes or reverify count will be incremented if it times out again.
- The next entry can be selected by oldest available (check last attempted time)
- If a pending audit was attempted and reverification count is increased, don't try it again for x amount of time. Add field for attempted_at timestamp
eg `WHERE attempted_at IS NULL OR attempted_at < now() - interval '6 hour'` in repairqueue.go:Select to find items to repair.
- Contained nodes will no longer be selected for new uploads
func (reverifier *Reverifier) process(ctx context.Context, pendingaudit *PendingAudit)) error {} // this will do the job of the current verifier.go:reverify method
```
- create method satellitedb/containment.go:Insert
- similar to existing increment pending but remove query for existing rows, just insert new pending audit directly into db.