storj/docs/blueprints/storage-node-downtime-tracking.md

# Storage Node Downtime Tracking

## Abstract

This document describes storage node downtime tracking.

## Background

[Disqualification blueprint](disqualification.md) describes how storage nodes get disqualified based on the reputation scores described in [node selection blueprint](node-selection.md).

Current disqualification based on uptime disqualified several nodes without clear and fair evidence. These disqualifications needed to be reverted and the uptime based disqualification disabled. Before we can start handling disqualifications we need to more reliably track offline status of nodes.

[Kademlia removal blueprint](kademlia-removal.md) describes that each node will need to contact each satellite regularly every hour. This is used in the following design.

This document does not describe how the downtime affects reputation and how disqualifications will work.

## Design

For tracking offline duration we need:

- A new SQL database table to store offline duration.
- Detecting which nodes are offline.
- Estimate how long they are offline.

An _uptime check_ referenced in this section is a connection initiated by the satellite to any storage node in the same way that's described in [network refreshing section of the Kademlia removal blueprint](kademlia-removal.md#network-refreshing).

__NOTE__ the SQL code in this section is illustrative for explaining the algorithm concisely.

### Database

The following new SQL database table will store storage nodes offline time.

```sql
CREATE TABLE nodes_offline_time (
    node_id    BYTEA NOT NULL,
    tracked_at timestamp with time zone NOT NULL,
    seconds    integer -- Measured number of seconds being offline
)
```

The current Satellite database has the table `nodes`. For the offline time calculation we use the following columns:

- `last_contact_success`
- `last_contact_failure`
- `total_uptime_count`
- `uptime_success_count`

### Detecting offline nodes

Per [Kademlia removal blueprint](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing), any storage node has to ping the satellite every hour. For storage nodes that have not pinged, we need to contact them directly.

For finding the storage nodes gone offline, we run a chore, with the following query:

```sql
SELECT
FROM nodes
WHERE
    last_contact_success < (now()  - 1h) AND
    last_contact_success > last_contact_failure AND -- only select nodes that were last known to be online
    disqualified IS NULL
ORDER BY
    last_contact_success ASC
```

For each node, the satellite performs an _uptime check_.

* On success, it updates the nodes table with the last contact information:

    ```sql
    UPDATE nodes
    SET
        last_contact_success = MAX(now(), last_contact_success),
        uptime_success_count = uptime_success_count + 1,
        total_uptime_count = total_uptime_count + 1
    WHERE
        id = ?
    ```

* On failure, it calculates the number of offline seconds.

  We know that storage nodes must contact the satellite every hour, hence we can estimate that it must have been at least for `now - last_contact_success - 1h` offline.

  ```
  num_seconds_offline = seconds(from: last_contact_success, to: now() - 1h)
  ```

  ```sql
  INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)
  VALUES (<<id>>, now(), <<num_seconds_offline>>)
  ```

  ```sql
  UPDATE nodes
  SET
    last_contact_failure = now(),
    total_uptime_count = total_uptime_count +1
  WHERE
    id = ?
   ```

### Estimating offline time

Another independent chore has the following configurable parameters:

- Interval,
- Number of nodes to check.

The process loops all failed nodes with query:

```sql
SELECT
FROM nodes
WHERE
    last_contact_success < last_contact_failure AND -- only select nodes that were last known to be offline
    disqualified IS NULL
ORDER BY
    last_contact_failure ASC
LIMIT N
```

It checks the configured number of nodes, then will sleep the configured amount of time and start again.

For each node it performs an _uptime check_.

* On success, it updates the nodes table:

  ```sql
  UPDATE nodes
  SET
    last_contact_success = MAX(now(), last_contact_success),
    uptime_success_count = uptime_success_count + 1,
    total_uptime_count = total_uptime_count +1
  WHERE
    id = ?
  ```

* On failure, it calculates the number of seconds offline from now and the last contact failure.

  ```
  num_seconds_offline =  seconds(from: last_contact_failure, to: now())
  ```

  ```sql
  INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)
  VALUES (<<id>>, now(), <<num_seconds_offline>>)
  ```

  ```sql
  UPDATE nodes
  SET
    last_contact_failure = now(),
    total_uptime_count = total_uptime_count + 1
  WHERE
    id = ?
  ```

## Rationale

The designed approach has the drawback that `last_contact_failure` of the `nodes` table may get updated by other satellite services before the _estimating offline time_ chore reads the last value and calculates the number of offline seconds.

The following diagram shows one of these scenarios:

![missing tracking offline seconds](images/storagenode-downtime-tracking-missing-offline-seconds.png)

The solution is to restrict to this new service the updates of the `last_contact_failure`. The other satellite services will have to inform when they detect an uptime failure, but this solution increases the complexity and probably impacts the performance of those services due to the introduced indirection.

The services, which update the `last_contact_failure` choose storage nodes randomly, hence we believe that these corner cases are minimal and losing some offline seconds tracking is acceptable and desirable for having a simpler solution.

Next, we present some alternative architectural solutions.

### Independent Process

Currently all chores and services run within a single process. Alternatively there could be an independent process for _offline downtime tracking_ as described in the [design section](#design).

The advantages are:

* It doesn't add a new application chore to the satellite.
* It's easier to scale.

And the disadvantages are:

* It requires to expose via a wire protocol the data selected from the nodes table. This adds more work and more latency apart from not offloading the current database<sup>1</sup>.
* It requires to update the deployment process.

The disadvantages outweigh the advantages of considering that:

* We want to start to track storage nodes offline time.
* It doesn't offload the database despite being split in a different service.
* This approach conflicts with horizontally scaling satellite work and would require coordinating the tasks.

<sup>1</sup> We want to reduce calls to the current database.

### InfluxDB

The designed system uses a SQL database for storing the storage nodes downtime. Alternatively it could use [InfluxDB time-series database](https://www.influxdata.com/).

The advantages are:

* Data Science team is already using it for data analysis.

And the disadvantages are:

* It requires InfluxDB for deployments, for testing and production. Currently we only use it for metrics.

Data Science could use this approach to more nicely calculate statistics however, it will complicate the deployment.

## Implementation

1. Create a new chore implementing the logic in the [design section](#design).
    1. Create migration to add the new database table.
    1. Implement chore struct.
    1. Implement [_detecting offline nodes_ part](#detecting-offline-nodes)<sup>1</sup>.
    1. Implement [_estimating offline time_ part](#estimating-offline-time)<sup>1</sup>.

    <sup>1</sup> These subtasks can be done in parallel.
1. Wire the new chore to the `satellite.Core`.
1. Remove the implementation of the current uptime disqualification.
  - `satellite/satellitedb.Overlaycache.UpdateUptime`: Remove update disqualified field due to lower uptime reputation.
   - `satellite/satellitedb.Overlaycache.populateUpdateNodeStats`: Remove update disqualified field due to lower uptime reputation.
   - Remove uptime reputation cutt-off configuration field (`satellite/overlay.NodeSelectionConfig.UptimeReputationDQ`).

## Wrapup

* The team working on the implementation must archive this document once finished.
* The new package which contains the chores implementation must have a `doc.go` file describing what each chore does and the corner case, described in the rationale section, of not tracking some offline time.

## Open issues

* The design needs to account for potential satellite or DNS outages to ensure that we do not unfairly disqualify nodes if the satellite cannot be contacted.
* The design indefinitely checks offline storage nodes until they are disqualified.
* The implementation requires coordination with the team working in [Kademlia removal blueprint](kademlia-removal.md) for the "ping" functionality.
* The implementation requires the [Kademlia removal network refreshing](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing) implemented and deployed before deploying the new chore. Use a feature flag for removing the constraint.
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00			`# Storage Node Downtime Tracking`

			`## Abstract`

			`This document describes storage node downtime tracking.`

			`## Background`

docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`[Disqualification blueprint](disqualification.md) describes how storage nodes get disqualified based on the reputation scores described in [node selection blueprint](node-selection.md).`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00
			`Current disqualification based on uptime disqualified several nodes without clear and fair evidence. These disqualifications needed to be reverted and the uptime based disqualification disabled. Before we can start handling disqualifications we need to more reliably track offline status of nodes.`

docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`[Kademlia removal blueprint](kademlia-removal.md) describes that each node will need to contact each satellite regularly every hour. This is used in the following design.`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00
			`This document does not describe how the downtime affects reputation and how disqualifications will work.`

			`## Design`

			`For tracking offline duration we need:`

			`- A new SQL database table to store offline duration.`
			`- Detecting which nodes are offline.`
			`- Estimate how long they are offline.`

docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`An _uptime check_ referenced in this section is a connection initiated by the satellite to any storage node in the same way that's described in [network refreshing section of the Kademlia removal blueprint](kademlia-removal.md#network-refreshing).`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00
			`__NOTE__ the SQL code in this section is illustrative for explaining the algorithm concisely.`

			`### Database`

			`The following new SQL database table will store storage nodes offline time.`

			```sql
			`CREATE TABLE nodes_offline_time (`
			`node_id BYTEA NOT NULL,`
			`tracked_at timestamp with time zone NOT NULL,`
			`seconds integer -- Measured number of seconds being offline`
			`)`
			```

			The current Satellite database has the table `nodes`. For the offline time calculation we use the following columns:

			- `last_contact_success`
			- `last_contact_failure`
			- `total_uptime_count`
			- `uptime_success_count`

			`### Detecting offline nodes`

docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`Per [Kademlia removal blueprint](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing), any storage node has to ping the satellite every hour. For storage nodes that have not pinged, we need to contact them directly.`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00
			`For finding the storage nodes gone offline, we run a chore, with the following query:`

			```sql
			`SELECT`
			`FROM nodes`
			`WHERE`
			`last_contact_success < (now() - 1h) AND`
			`last_contact_success > last_contact_failure AND -- only select nodes that were last known to be online`
			`disqualified IS NULL`
			`ORDER BY`
			`last_contact_success ASC`
			```

			`For each node, the satellite performs an _uptime check_.`

			`* On success, it updates the nodes table with the last contact information:`

			```sql
			`UPDATE nodes`
			`SET`
			`last_contact_success = MAX(now(), last_contact_success),`
			`uptime_success_count = uptime_success_count + 1,`
			`total_uptime_count = total_uptime_count + 1`
			`WHERE`
			`id = ?`
			```

			`* On failure, it calculates the number of offline seconds.`

			We know that storage nodes must contact the satellite every hour, hence we can estimate that it must have been at least for `now - last_contact_success - 1h` offline.

			```
			`num_seconds_offline = seconds(from: last_contact_success, to: now() - 1h)`
			```

			```sql
			`INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)`
			`VALUES (<<id>>, now(), <<num_seconds_offline>>)`
			```

			```sql
			`UPDATE nodes`
			`SET`
			`last_contact_failure = now(),`
			`total_uptime_count = total_uptime_count +1`
			`WHERE`
			`id = ?`
			```

			`### Estimating offline time`

			`Another independent chore has the following configurable parameters:`

			`- Interval,`
			`- Number of nodes to check.`

			`The process loops all failed nodes with query:`

			```sql
			`SELECT`
			`FROM nodes`
			`WHERE`
			`last_contact_success < last_contact_failure AND -- only select nodes that were last known to be offline`
			`disqualified IS NULL`
			`ORDER BY`
			`last_contact_failure ASC`
			`LIMIT N`
			```

			`It checks the configured number of nodes, then will sleep the configured amount of time and start again.`

			`For each node it performs an _uptime check_.`

			`* On success, it updates the nodes table:`

			```sql
			`UPDATE nodes`
			`SET`
			`last_contact_success = MAX(now(), last_contact_success),`
			`uptime_success_count = uptime_success_count + 1,`
			`total_uptime_count = total_uptime_count +1`
			`WHERE`
			`id = ?`
			```

			`* On failure, it calculates the number of seconds offline from now and the last contact failure.`

			```
			`num_seconds_offline = seconds(from: last_contact_failure, to: now())`
			```

			```sql
			`INSERT INTO nodes_offline_time (node_id, tracked_time, seconds)`
			`VALUES (<<id>>, now(), <<num_seconds_offline>>)`
			```

			```sql
			`UPDATE nodes`
			`SET`
			`last_contact_failure = now(),`
			`total_uptime_count = total_uptime_count + 1`
			`WHERE`
			`id = ?`
			```

			`## Rationale`

			The designed approach has the drawback that `last_contact_failure` of the `nodes` table may get updated by other satellite services before the _estimating offline time_ chore reads the last value and calculates the number of offline seconds.

			`The following diagram shows one of these scenarios:`

			`![missing tracking offline seconds](images/storagenode-downtime-tracking-missing-offline-seconds.png)`

			The solution is to restrict to this new service the updates of the `last_contact_failure`. The other satellite services will have to inform when they detect an uptime failure, but this solution increases the complexity and probably impacts the performance of those services due to the introduced indirection.

			The services, which update the `last_contact_failure` choose storage nodes randomly, hence we believe that these corner cases are minimal and losing some offline seconds tracking is acceptable and desirable for having a simpler solution.

			`Next, we present some alternative architectural solutions.`

			`### Independent Process`

			`Currently all chores and services run within a single process. Alternatively there could be an independent process for _offline downtime tracking_ as described in the [design section](#design).`

			`The advantages are:`

			`* It doesn't add a new application chore to the satellite.`
			`* It's easier to scale.`

			`And the disadvantages are:`

			`* It requires to expose via a wire protocol the data selected from the nodes table. This adds more work and more latency apart from not offloading the current database<sup>1</sup>.`
			`* It requires to update the deployment process.`

			`The disadvantages outweigh the advantages of considering that:`

			`* We want to start to track storage nodes offline time.`
			`* It doesn't offload the database despite being split in a different service.`
			`* This approach conflicts with horizontally scaling satellite work and would require coordinating the tasks.`

			`<sup>1</sup> We want to reduce calls to the current database.`

			`### InfluxDB`

			`The designed system uses a SQL database for storing the storage nodes downtime. Alternatively it could use [InfluxDB time-series database](https://www.influxdata.com/).`

			`The advantages are:`

			`* Data Science team is already using it for data analysis.`

			`And the disadvantages are:`

			`* It requires InfluxDB for deployments, for testing and production. Currently we only use it for metrics.`

			`Data Science could use this approach to more nicely calculate statistics however, it will complicate the deployment.`

			`## Implementation`

			`1. Create a new chore implementing the logic in the [design section](#design).`
			`1. Create migration to add the new database table.`
			`1. Implement chore struct.`
			`1. Implement [_detecting offline nodes_ part](#detecting-offline-nodes)<sup>1</sup>.`
			`1. Implement [_estimating offline time_ part](#estimating-offline-time)<sup>1</sup>.`

			`<sup>1</sup> These subtasks can be done in parallel.`
satellite: change the Peer name to Core (#3472) * change satellite.Peer name to Core * change to Core in testplanet * missed a few places * keep shared stuff in peer.go to stay consistent with storj/docs 2019-11-04 19:01:02 +00:00			1. Wire the new chore to the `satellite.Core`.
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00			`1. Remove the implementation of the current uptime disqualification.`
			- `satellite/satellitedb.Overlaycache.UpdateUptime`: Remove update disqualified field due to lower uptime reputation.
			- `satellite/satellitedb.Overlaycache.populateUpdateNodeStats`: Remove update disqualified field due to lower uptime reputation.
			- Remove uptime reputation cutt-off configuration field (`satellite/overlay.NodeSelectionConfig.UptimeReputationDQ`).

docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`## Wrapup`

			`* The team working on the implementation must archive this document once finished.`
			* The new package which contains the chores implementation must have a `doc.go` file describing what each chore does and the corner case, described in the rationale section, of not tracking some offline time.

docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00			`## Open issues`

docs/storage-node-downtime-tracking: add open issue regarding satellite outages (#3370) 2019-10-25 16:39:11 +01:00			`* The design needs to account for potential satellite or DNS outages to ensure that we do not unfairly disqualify nodes if the satellite cannot be contacted.`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00			`* The design indefinitely checks offline storage nodes until they are disqualified.`
docs/design: Adapt SN Downtime tracking to be a blueprint (#2931) Bluepints are replacing the design documents. A blueprint has a wrapup section, this commit adds the section to the document and replace any mention to design document by blueprint. 2019-09-03 15:44:10 +01:00			`* The implementation requires coordination with the team working in [Kademlia removal blueprint](kademlia-removal.md) for the "ping" functionality.`
docs/design: Storage Node downtime tracking (#2857) Create the Storage Node downtime tracking design document, using the current template and revision workflow approval. 2019-08-28 20:05:18 +01:00			`* The implementation requires the [Kademlia removal network refreshing](https://github.com/storj/storj/blob/master/docs/design/kademlia-removal.md#network-refreshing) implemented and deployed before deploying the new chore. Use a feature flag for removing the constraint.`