Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
Improve UpdateCheckIn on a contended row:
name old time/op new time/op delta
UpdateCheckInContended-100x-32 2.29s ±55% 0.17s ±61% -92.45% (p=0.008 n=5+5)
Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.
As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.
Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.
Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
Currently nodeselection package only contained state for uploads, move
these to a subpackage, such that we can make another "downloadselection"
for downloads. Then move selection logic from overlay to nodeselection.
Change-Id: I0fc42bcae3a29db2728dae9f3863b1e95bf5165b
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.
Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz
Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.
Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.
If the flag is set to true, nodes will be suspended if they meet the
requirements.
If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.
Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.
Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.
Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
Testing interfaces is slightly clearer when it's in the package needing
the database rather than each individual implementation.
Change-Id: I10334c214a205f7e510b939b4359a2214c4e060a
Full scope:
private/testplanet,satellite/{overlay,satellitedb}
Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.
There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.
Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
* Deduplicate NodeID list prior to fetching IPs.
* Use NodeSelectionCache for fetching reliable IPs.
* Return number of segements, reliable pieces and all pieces.
Change-Id: I13e679caab275488b4037624b840a4068dad9589
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.
Change-Id: Idb829d742e1f987e095604423fff656fe581183e
Full prefix: satellite/{overlay,nodestats},storagenode/{reputation,nodestats}
Allow the storagenode to receive its audit history data from the
satellite via the satellite's GetStats endpoint.
The storagenode does not save this data for use in the API yet.
Change-Id: I9488f4d7a4ccb4ccf8336b8e4aeb3e5beee54979
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation
Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.
Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.
This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.
Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.
Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.
However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.
Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.
Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.
Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.
Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.
In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.
Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.
Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
Add a function to the overlay cache called UpdateAuditHistory, which
allows us to add online or offline audits to a particular node's audit
history, and get that node's "online score" for the configured tracking
period.
The next step will be to use UpdateAuditHistory from inside
BatchUpdateStats/UpdateStats, so that audit history is actually updated
when nodes get audited, and we can suspend nodes based on their online
score.
Change-Id: I2289105e6961e68e829a987ff756b0e576fab120
Adds AuditHistory{WindowSize, TrackingPeriod, GracePeriod,
OfflineThreshold}. These values will be used to track offline audits over
time, and to suspend/disqualify nodes for being offline for too long.
Change-Id: I05f7dbc3c034bdc53c4fbd7719c71a44f37ec6a5
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.
When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.
Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.
Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.
Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.
Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.
Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.
Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
GetNodes returned references to nodes in the immutable state, however
some parts of code expect them to be modified.
Change-Id: I5be1866f95e0dbe062a6b6be60e29f2365c35faa
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.
Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.
Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.
Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
time.Now() in a short amount of time can return the exact same value.
Ensure that the test uses times that are distinct.
Change-Id: Ia653ce0af4bfcf7b5da133a9cf98b823033d9592
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)
Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
Update unknown_audit_reputation_alpha and unknown_audit_reputation_beta.
Add test to verify that BatchUpdateStats properly modifies unknown audit
alpha/beta
Change-Id: I0d5f9cac96a99f64905cf575b772402db0756a9d
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.
Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.
Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.
Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
Adds a test to make sure that the correct amount of total nodes,
reputable nodes, and new nodes are returned by SelectStorageNodes
in different cases.
Change-Id: I0939159600afde8a46c35735f1edf0576fcdb4cd