[Disqualification blueprint](disqualification.md) describes how storage nodes get disqualified based on the reputation scores described in [node selection blueprint](node-selection.md).
Current disqualification based on uptime disqualified several nodes without clear and fair evidence. These disqualifications needed to be reverted and the uptime based disqualification disabled. Before we can start handling disqualifications we need to more reliably track offline status of nodes.
[Kademlia removal blueprint](kademlia-removal.md) describes that each node will need to contact each satellite regularly every hour. This is used in the following design.
An _uptime check_ referenced in this section is a connection initiated by the satellite to any storage node in the same way that's described in [network refreshing section of the Kademlia removal blueprint](kademlia-removal.md#network-refreshing).
Per [Kademlia removal blueprint](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing), any storage node has to ping the satellite every hour. For storage nodes that have not pinged, we need to contact them directly.
* On failure, it calculates the number of offline seconds.
We know that storage nodes must contact the satellite every hour, hence we can estimate that it must have been at least for `now - last_contact_success - 1h` offline.
```
num_seconds_offline = seconds(from: last_contact_success, to: now() - 1h)
```
```sql
INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)
VALUES (<<id>>, now(), <<num_seconds_offline>>)
```
```sql
UPDATE nodes
SET
last_contact_failure = now(),
total_uptime_count = total_uptime_count +1
WHERE
id = ?
```
### Estimating offline time
Another independent chore has the following configurable parameters:
* On failure, it calculates the number of seconds offline from now and the last contact failure.
```
num_seconds_offline = seconds(from: last_contact_failure, to: now())
```
```sql
INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)
VALUES (<<id>>, now(), <<num_seconds_offline>>)
```
```sql
UPDATE nodes
SET
last_contact_failure = now(),
total_uptime_count = total_uptime_count + 1
WHERE
id = ?
```
## Rationale
The designed approach has the drawback that `last_contact_failure` of the `nodes` table may get updated by other satellite services before the _estimating offline time_ chore reads the last value and calculates the number of offline seconds.
The following diagram shows one of these scenarios:
The solution is to restrict to this new service the updates of the `last_contact_failure`. The other satellite services will have to inform when they detect an uptime failure, but this solution increases the complexity and probably impacts the performance of those services due to the introduced indirection.
The services, which update the `last_contact_failure` choose storage nodes randomly, hence we believe that these corner cases are minimal and losing some offline seconds tracking is acceptable and desirable for having a simpler solution.
Next, we present some alternative architectural solutions.
### Independent Process
Currently all chores and services run within a single process. Alternatively there could be an independent process for _offline downtime tracking_ as described in the [design section](#design).
The advantages are:
* It doesn't add a new application chore to the satellite.
* It's easier to scale.
And the disadvantages are:
* It requires to expose via a wire protocol the data selected from the nodes table. This adds more work and more latency apart from not offloading the current database<sup>1</sup>.
* It requires to update the deployment process.
The disadvantages outweigh the advantages of considering that:
* We want to start to track storage nodes offline time.
* It doesn't offload the database despite being split in a different service.
* This approach conflicts with horizontally scaling satellite work and would require coordinating the tasks.
<sup>1</sup> We want to reduce calls to the current database.
### InfluxDB
The designed system uses a SQL database for storing the storage nodes downtime. Alternatively it could use [InfluxDB time-series database](https://www.influxdata.com/).
The advantages are:
* Data Science team is already using it for data analysis.
And the disadvantages are:
* It requires InfluxDB for deployments, for testing and production. Currently we only use it for metrics.
Data Science could use this approach to more nicely calculate statistics however, it will complicate the deployment.
## Implementation
1. Create a new chore implementing the logic in the [design section](#design).
1. Create migration to add the new database table.
* The team working on the implementation must archive this document once finished.
* The new package which contains the chores implementation must have a `doc.go` file describing what each chore does and the corner case, described in the rationale section, of not tracking some offline time.
* The design needs to account for potential satellite or DNS outages to ensure that we do not unfairly disqualify nodes if the satellite cannot be contacted.
* The implementation requires the [Kademlia removal network refreshing](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing) implemented and deployed before deploying the new chore. Use a feature flag for removing the constraint.