Commit Graph

18 Commits

Author SHA1 Message Date
Cameron Ayer
1a51049ac0 satellite/{overlay,satellitedb}: add flag to toggle suspending nodes for offline audits
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.

If the flag is set to true, nodes will be suspended if they meet the
requirements.

If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.

Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
2021-03-27 16:28:27 +00:00
Kaloyan Raev
c24ada7114 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum

Change-Id: Icf7c029e9d800e5f6a9fdd208c36f28e05468690
2021-01-20 17:35:57 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Michal Niewrzal
7dde184cb5 Merge 'master' branch
Change-Id: I6070089128a150a4dd501bbc62a1f8b394aa643e
2020-11-10 11:58:59 +00:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Cameron Ayer
b39a99bae6 satellite/{overlay,satellitedb}: always show node's real online score
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.

However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.

Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.

Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
2020-10-02 12:28:11 -04:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Moby von Briesen
8f60cfc4fb satellite/overlay: Add flag for enabling/disabling disqualification from suspension mode
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.

Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
2020-05-04 17:25:09 +00:00
Moby von Briesen
720e26d235 satellite/satellitedb/overlaycache: update unknown alpha/beta values properly
Update unknown_audit_reputation_alpha and unknown_audit_reputation_beta.
Add test to verify that BatchUpdateStats properly modifies unknown audit
alpha/beta

Change-Id: I0d5f9cac96a99f64905cf575b772402db0756a9d
2020-04-23 10:40:53 -04:00
Moby von Briesen
72b93f3120 satellite/satellitedb: disqualify suspended nodes when the grace period passes
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.

Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.

Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
2020-04-22 15:45:00 -04:00
Moby von Briesen
8b72181a1f satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.

Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
2020-03-16 20:29:26 +00:00