# Auditing V2: Random Node Selection ## Abstract This design document describes auditing based on reservoir sampling segments per node. ## Background As our network grows, it will take longer for nodes to get vetted. This is because every time an upload happens, we send 5% of the uploaded data to unvetted nodes and 95% to vetted nodes. Currently, we select a random stripe from a random segment for audits. This correlates with auditing per byte. This means we are less likely to audit an unvetted node because only 5% gets uploaded to unvetted nodes. It will become exponentially less likely that an unvetted node will be audited. With a satellite with one petabyte of data, new nodes will take one month to get vetted. However, with 12PB of data on the network, this vetting process time would take 12 months, which is much too long. We need a scalable approach. We want a way to select segments to audit based such that every node has an equal likelihood to be audited. ## Design 1. An audit observer iterates over the segments and uses reservoir sampling to pick paths for each node. 2. Once we have iterated over metainfo, we will put the segments from reservoirs in random order into audit queue. 3. Audit workers pick a segment from the queue. 4. Audit worker then picks a random stripe from the segment to audit. Using reservoir sampling means we have an equal chance to pick a segment for every node. Since every segment also audits 79 other nodes, we also audit other nodes. The chance of a node appearing in a segment pointer is proportional to the amount of data that the node actually stores. The more data that the node stores, the more chance it will be audited. For unvetted and vetted nodes, we can have different reservoir sizes to ensure that unvetted nodes get audited faster. By using a separate queue, we ensure that workers can run in separate processes and simplifies selection logic. When we finish a new reservoir set, we override the previous queue, rather than adding to it. Since the new data is more up to date and there's no downside in clearing the queue. To have less predictability, we add the whole reservoir set in random node order, one segment at a time, to the queue. Audit workers audit as previously: 1. Pick a segment from the queue. 2. Pick a random stripe. 3. Download all erasure shares. 4. Use Berlekamp-Welch algorithm to verify correctness. This is a simplified version that doesn't describe [containment mode](audit-containment.md). Chances of selecting the same stripe are rare, but it wouldn't cause any significant harm. To estimate appropriate settings for reservoir sampling, we need to run a simulation. ### Selection via Reservoir Sampling Reservoir sampling: a family of algorithms for randomly choosing a sample of items with uniform probability from a stream of data (of unknown size). Audit observer uses metainfo loop to iterate through the metainfo. It creates a reservoir per node. Reservoirs are filled with segments. To increase audits for unvetted nodes, we can create a larger reservoir for unvetted nodes. Two configurations for reservoir sampling: number of segments per unvetted nodes, and for vetted. E.g. If nodes `n000`, `n002`, and `n003` are vetted, they will have fewer reservoir slots than unvetted nodes `n001` and `n004`. ``` n000 + + + + n001 + + + + + + + n002 + + + + n003 + + + + n004 + + + + + + + ``` Unvetted nodes should get 25,000 pieces per month. On a good day, there will be 1000 pieces added to an unvetted node, which should quickly fill the reservoir sample. Algorithm: + We have a reservoir of `k` items and a `stream` of `n` items, where `n` is an unknown number. + Fill the reservoir from `[0...k-1]` with the first `k` items in the `stream`. + For every item in the `stream` from index `i=k..n-1`, pick a random number `j=rand(0..i)`, and if `j