Commit Graph

176 Commits

Author SHA1 Message Date
Cameron Ayer
53322bb0a7 satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.

Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
2021-06-01 21:02:44 +00:00
Egon Elbre
cdcc67207c satellite/satellitedb: fix nil panic in UpdateCheckIn
Change-Id: If6ae2c3d9b7c269b0a9d652e68854091f668b5ec
2021-05-25 00:30:36 +03:00
Egon Elbre
8f15f975a2 satellite/overlay: improve contended update checkin
Improve UpdateCheckIn on a contended row:

  name                             old time/op  new time/op delta
  UpdateCheckInContended-100x-32   2.29s ±55%   0.17s ±61%  -92.45%  (p=0.008 n=5+5)

Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
2021-05-16 20:41:12 +03:00
Egon Elbre
0858c3797a satellite/{metabase,satellitedb}: deduplicate AS OF SYSTEM TIME code
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.

As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.

Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.

Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
2021-05-11 12:40:36 +03:00
Cameron Ayer
bb343d9028 satellite/satellitedb: don't remove offline nodes from containment
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.

Add statement to record audit results in reverify tests to ensure no
more false positives.

Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
2021-05-03 16:05:55 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Egon Elbre
a2e20c93ae private/dbutil: use dbutil and tagsql from storj.io/private
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.

Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
2021-04-23 14:36:52 +03:00
Cameron Ayer
a0c5da6643 satellite/satellitedb: in stray nodes DQ, don't DQ nodes where last_contact_success = '0001-01-01 00:00:00+00'
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.

Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz

Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
2021-04-22 10:13:13 -04:00
Jeff Wendling
a65aecfd98 compensation: always generate invoices for every node
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.

Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
2021-03-29 14:15:45 +00:00
Cameron Ayer
05f8d2d0b1 satellite/satellitedb: filter offline suspended nodes from selection
Change-Id: I5a6f413453332238d579a7bf50eb30e9156f96c2
2021-03-27 23:36:46 +00:00
Cameron Ayer
1a51049ac0 satellite/{overlay,satellitedb}: add flag to toggle suspending nodes for offline audits
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.

If the flag is set to true, nodes will be suspended if they meet the
requirements.

If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.

Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
2021-03-27 16:28:27 +00:00
Cameron Ayer
eb44dc21b4 satellite/satellitedb: select stray nodes for DQ in separate tx from update
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.

Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
2021-03-27 00:00:23 +00:00
Cameron Ayer
2607b16070 satellite/{overlay/straynodes,satellitedb}: rework DQNodesLastSeenBefore to return DQd node IDs and last contact successes
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.

Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
2021-03-22 13:01:30 -04:00
Cameron Ayer
aeac6264cd sallite/satellitedb: add metric stray_nodes_dq_count
Add metric so we can see how many nodes are DQd due to
this.

Change-Id: Ie4bdd1375fb9bd948af14fed9a2962b783b6a526
2021-03-01 21:06:36 +00:00
Cameron Ayer
549033f2e6 satellite/satellitedb: don't include DQd and exited nodes in DQStrayNodes
Don't update DQ time of already DQd nodes. Don't DQ nodes who exited.

Change-Id: I4528a9ba9f8e278987165ad337a9b34dadb9788b
2021-02-19 15:12:30 -05:00
JT Olio
b2ed7edd30 cmd/satellite: restore-trash parallel workers
Change-Id: Ic7466b21c20bda334e7ba4268a494e96b6528ac1
2021-02-18 19:11:19 +02:00
JT Olio
3ae3389ddc cmd/satellite: restore-trash command
Change-Id: I80fc932c12147692d49cde277784871ac611fcad
2021-02-18 09:19:22 -07:00
Yaroslav Vorobiov
966535e9de {storagenode,satellite}/nodeoperator: add wallet features
Change-Id: Iac7eb40a52b8fddcc573aebaad2e3a30a10cded9
2021-02-08 22:09:45 +02:00
Cameron Ayer
a17934cb51 satellite/satellitedb: remove reference to uptime counts
Change-Id: I26ac540b720a8ba5d6ca44526900228352dcaf4e
2021-02-02 14:51:27 -05:00
Egon Elbre
54e01d37f9 satellite/overlay: add DownloadSelectionCache
Change-Id: Ic0779280172325f8d03f55a2e9673722f72bdd44
2021-01-29 16:47:06 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Cameron Ayer
75d828200c private,satellite: add chore to dq stray nodes
Full scope:
private/testplanet,satellite/{overlay,satellitedb}

Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.

There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.

Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
2021-01-19 14:21:56 -05:00
Cameron Ayer
0403e99a5b satellite/{overlay,satellitedb}: remove unused methods for old downtime tracking
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.

Change-Id: Idb829d742e1f987e095604423fff656fe581183e
2021-01-11 15:21:28 +00:00
Moby von Briesen
6e2ef3b9ee Revert "satellite/satellitedb: Do not consider nodes with offline_suspended as reputable."
This reverts commit e24262c2c9.

Change-Id: I287deb2e52d03bbd698ed055f0f216b0b5bf2798
2021-01-04 14:28:37 +00:00
Moby von Briesen
825dc71227 satellite/{overlay, satellitedb}: Refactor audit history
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation

Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
2020-12-29 18:50:22 +00:00
Moby von Briesen
e24262c2c9 satellite/satellitedb: Do not consider nodes with offline_suspended as reputable.
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.

Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
2020-12-29 17:59:09 +00:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Ethan
5dc013d3bd satellite/overlay: Add retry to all selects in overlaycache
Change-Id: I0356d71a35701f8e0ca04a34b2bb2aea666c1394
2020-11-29 16:46:57 -05:00
JT Olio
0ba516d405 satellite: support pointing db components at different databases
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.

Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
2020-11-28 18:39:16 +00:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
11338e9beb satellite/internalpb: move audithistory.pb
Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4
2020-10-30 15:30:11 +02:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Cameron Ayer
b39a99bae6 satellite/{overlay,satellitedb}: always show node's real online score
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.

However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.

Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.

Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
2020-10-02 12:28:11 -04:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
stefanbenten
257855b5de all: replace == comparison with errors.Is
Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714
2020-07-14 15:50:25 +00:00
paul cannon
bbdb351e5e all: use jackc/pgx in place of lib/pq
What:

Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.

Why:

github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.

Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
2020-07-13 15:54:41 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Egon Elbre
f68e7b3fde satellite/overlay: replace pb.InfoResponse
pb.InfoResponse wasn't used for protocol buffer communication, but
instead as a satellite type.

Change-Id: I755619f2deec5b76c4fe488591b7d8c1b9fcdafb
2020-06-16 15:16:55 +03:00
paul cannon
7b8e91ff28 satellite/satellitedb: no orders for exited nodes
We should not be sending any type of orders to nodes that have completed
graceful exit with the current satellite. In particular, we should not
be trying to audit them, because that would be silly.

Change-Id: Ie2153e5739914ab696feefcdef28545ed70f84e4
2020-06-13 13:49:33 +00:00
Cameron Ayer
bad299b541 satellite/satellitedb: serialize UpdateStats and BatchUpdateStats transactions
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.

Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
2020-06-10 17:11:28 +00:00
Egon Elbre
36c461bd59 private/tagsql: track proper closing of rows and statements
This ensures that rows are closed to avoid leaks.
Also verifies that Err() is called, to ensure that no
error is left behind.

Change-Id: Idd1bec9bf479f40021da67b2c80ce83033149469
2020-06-05 18:25:43 +00:00
Egon Elbre
34db4a80fd ci: fix staticcheck failures
Change-Id: I176fb24214755a1940a0a1a4e9cc8e39f184870b
2020-06-05 13:15:34 +00:00
Yingrong Zhao
163c027a6d satellite/satellitedb: remove monkit trace from convertDBNode
In jaeger, it shows that this function gets called repetitively in
a single request. Most of the time, it's less than 1ms. Therefore, it
doesn't add much value in our trace but create noises.

Change-Id: I20234f36bbcf0fc22f91e5e1a5634c0cad577ed0
2020-06-01 17:58:43 +00:00
Jennifer Johnson
03e5f922c3 satellite/overlay: updates node with a vetted_at timestamp if they meet the vetting criteria
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.

Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.

Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.

Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
2020-05-20 16:30:26 -04:00