Commit Graph

128 Commits

Author SHA1 Message Date
Kaloyan Raev
c24ada7114 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum

Change-Id: Icf7c029e9d800e5f6a9fdd208c36f28e05468690
2021-01-20 17:35:57 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Cameron Ayer
75d828200c private,satellite: add chore to dq stray nodes
Full scope:
private/testplanet,satellite/{overlay,satellitedb}

Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.

There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.

Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
2021-01-19 14:21:56 -05:00
Kaloyan Raev
6dff40f5c5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/metainfo.go

Change-Id: Ib5c49f3c911c58319855a171f9ce73657da976d9
2021-01-14 14:33:59 +02:00
Egon Elbre
85fb964afe satellite/{metainfo,overlay}: improvements to GetObjectIPs
* Deduplicate NodeID list prior to fetching IPs.
* Use NodeSelectionCache for fetching reliable IPs.
* Return number of segements, reliable pieces and all pieces.

Change-Id: I13e679caab275488b4037624b840a4068dad9589
2021-01-14 09:12:45 +00:00
Cameron Ayer
0403e99a5b satellite/{overlay,satellitedb}: remove unused methods for old downtime tracking
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.

Change-Id: Idb829d742e1f987e095604423fff656fe581183e
2021-01-11 15:21:28 +00:00
Michał Niewrzał
ec88d21a3c Merge 'main' branch.
Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae
2021-01-11 10:26:58 +01:00
Moby von Briesen
6e2ef3b9ee Revert "satellite/satellitedb: Do not consider nodes with offline_suspended as reputable."
This reverts commit e24262c2c9.

Change-Id: I287deb2e52d03bbd698ed055f0f216b0b5bf2798
2021-01-04 14:28:37 +00:00
Michał Niewrzał
ad3e3a38c5 Merge 'main' branch
Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200
2021-01-04 12:13:45 +01:00
Moby von Briesen
edbee53888 satellite,storagenode: Pass audit history over GetStats endpoint
Full prefix: satellite/{overlay,nodestats},storagenode/{reputation,nodestats}

Allow the storagenode to receive its audit history data from the
satellite via the satellite's GetStats endpoint.

The storagenode does not save this data for use in the API yet.

Change-Id: I9488f4d7a4ccb4ccf8336b8e4aeb3e5beee54979
2020-12-30 19:13:26 +00:00
Moby von Briesen
825dc71227 satellite/{overlay, satellitedb}: Refactor audit history
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation

Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
2020-12-29 18:50:22 +00:00
Moby von Briesen
e24262c2c9 satellite/satellitedb: Do not consider nodes with offline_suspended as reputable.
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.

Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
2020-12-29 17:59:09 +00:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Moby von Briesen
3fc76f4ffe satellite/downtime: Remove deprecated downtime tracking service.
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.

This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.

Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
2020-12-02 15:16:13 -05:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
11338e9beb satellite/internalpb: move audithistory.pb
Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4
2020-10-30 15:30:11 +02:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Egon Elbre
2268cc1df3 all: fix linter complaints
Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95
2020-10-13 15:59:01 +03:00
Cameron Ayer
b39a99bae6 satellite/{overlay,satellitedb}: always show node's real online score
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.

However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.

Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.

Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
2020-10-02 12:28:11 -04:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Moby von Briesen
5f0477ebe9 satellite/{overlay,satellitedb}: Create database functionality for updating audit history
Add a function to the overlay cache called UpdateAuditHistory, which
allows us to add online or offline audits to a particular node's audit
history, and get that node's "online score" for the configured tracking
period.

The next step will be to use UpdateAuditHistory from inside
BatchUpdateStats/UpdateStats, so that audit history is actually updated
when nodes get audited, and we can suspend nodes based on their online
score.

Change-Id: I2289105e6961e68e829a987ff756b0e576fab120
2020-08-20 17:34:27 +00:00
Qweder93
01bb2bd17d satellite/audit: verifier checks if node made sucess GE before auditing
Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80
2020-08-17 23:37:57 +03:00
Egon Elbre
94a09ce20b all: add missing dots
Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a
2020-08-11 17:50:01 +03:00
Qweder93
4ee1b2d45a storagenode/console: added list of all audits per satellite to sno dashboard/satellites
Change-Id: I52e58748d6467f372d9a308347fc77e400d137e2
2020-08-10 12:55:07 +00:00
Moby von Briesen
e02adfe5e9 satellite/overlay/config.go: Add AuditHistoryConfig to overlay
Adds AuditHistory{WindowSize, TrackingPeriod, GracePeriod,
OfflineThreshold}. These values will be used to track offline audits over
time, and to suspend/disqualify nodes for being offline for too long.

Change-Id: I05f7dbc3c034bdc53c4fbd7719c71a44f37ec6a5
2020-08-04 18:18:56 +00:00
Moby von Briesen
5dfe27f175 satellite/{repair,overlay}: Use overlay NodeSelectionCache for repair uploads
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.

When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.

Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
2020-08-04 12:50:12 -04:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
stefanbenten
257855b5de all: replace == comparison with errors.Is
Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714
2020-07-14 15:50:25 +00:00
Egon Elbre
5bdcd86fa7 ci: test benchmarks
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.

Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
2020-07-10 13:26:49 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Egon Elbre
f68e7b3fde satellite/overlay: replace pb.InfoResponse
pb.InfoResponse wasn't used for protocol buffer communication, but
instead as a satellite type.

Change-Id: I755619f2deec5b76c4fe488591b7d8c1b9fcdafb
2020-06-16 15:16:55 +03:00
Cameron Ayer
bad299b541 satellite/satellitedb: serialize UpdateStats and BatchUpdateStats transactions
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.

Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
2020-06-10 17:11:28 +00:00
Egon Elbre
c5d4a13158 satellite/nodeselection: use NodeURL
Change-Id: I2ebd4dbf993ff5c7864f3a3a665b5c8fc48aa7d1
2020-05-27 05:46:11 +00:00
littleskunk
2fbb34c3ea
nodeselection: Increase minimum free space to 500MB (#3898) 2020-05-25 12:13:28 +02:00
Egon Elbre
bef84a5f9d storagenode: remove dependency to overlay.NodeDossier
This is the last dependency from storage node to satellite.

Change-Id: I12f7abb91e84f823ba5af126c6e2979519838612
2020-05-21 08:37:13 +03:00
Jennifer Johnson
03e5f922c3 satellite/overlay: updates node with a vetted_at timestamp if they meet the vetting criteria
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.

Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.

Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.

Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
2020-05-20 16:30:26 -04:00
Egon Elbre
5d016425f1 satellite/{contact,downtime,overlay}: use NodeURL
Change-Id: I555a479a89e0ddbf0499898bdbc8574282cd6846
2020-05-20 11:09:05 +00:00
Egon Elbre
941d10cbc3 private/testplanet: remove Peer.Local()
Currently storagenode depends on overlay.NodeDossier, this is the first
step in removing it.

Change-Id: I034a3f1601835f8349bd41752455022e19bcc707
2020-05-20 11:05:34 +00:00
littleskunk
8ec64f3daf
satellite/overlay: enable node selection cache on all satellites (#3895) 2020-05-19 19:25:53 +02:00
Egon Elbre
16abf02b35 satellite/{nodeselection,overlay}: use the new package
Change-Id: I034fdbe578dec2e5c906aca82231cd3e56f26aeb
2020-05-18 21:38:43 +00:00
Jessica Grebenschikov
1ecbfc453f satellite/overlay: add random test for node selection from cache
Change-Id: I57b57db919b5afb3fc81176e078ac1e3a873565a
2020-05-18 16:59:52 +00:00
Egon Elbre
ec589a8289 all: fix comments about grpc
Change-Id: Id830fbe2d44f083c88765561b6c07c5689afe5bd
2020-05-11 13:05:34 +03:00