Commit Graph

153 Commits

Author SHA1 Message Date
Cameron Ayer
a0c5da6643 satellite/satellitedb: in stray nodes DQ, don't DQ nodes where last_contact_success = '0001-01-01 00:00:00+00'
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.

Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz

Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
2021-04-22 10:13:13 -04:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Jeff Wendling
a65aecfd98 compensation: always generate invoices for every node
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.

Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
2021-03-29 14:15:45 +00:00
Egon Elbre
d57873fd45 satellite/overlay: remove Inspector
Currently overlay.Inspector had two rpc methods and both of them were
unimplemented.

Change-Id: I1a2ecc7b7113898fa234a1c1fe451c8cc9e2ee81
2021-03-29 12:26:10 +03:00
Egon Elbre
86e698f572 pb: use *UnimplementedServer to avoid breaking API changes
Change-Id: I99a34eeb37ac4453411f273511710562a519f57a
2021-03-29 12:26:10 +03:00
Cameron Ayer
05f8d2d0b1 satellite/satellitedb: filter offline suspended nodes from selection
Change-Id: I5a6f413453332238d579a7bf50eb30e9156f96c2
2021-03-27 23:36:46 +00:00
Cameron Ayer
1a51049ac0 satellite/{overlay,satellitedb}: add flag to toggle suspending nodes for offline audits
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.

If the flag is set to true, nodes will be suspended if they meet the
requirements.

If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.

Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
2021-03-27 16:28:27 +00:00
Cameron Ayer
eb44dc21b4 satellite/satellitedb: select stray nodes for DQ in separate tx from update
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.

Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
2021-03-27 00:00:23 +00:00
Michał Niewrzał
237782813b Merge remote-tracking branch 'origin/multipart-upload'
Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f
2021-03-23 09:44:49 +01:00
Cameron Ayer
864ad70fe2 satellite/overlay/straynodes: set --stray-nodes.enable-dq release default to true
Since we will enable this on all satellites, just set default to true

Change-Id: Ibc86a0afd0b0f57e86bd067abb9cdf06c295a467
2021-03-22 17:25:09 +00:00
Cameron Ayer
2607b16070 satellite/{overlay/straynodes,satellitedb}: rework DQNodesLastSeenBefore to return DQd node IDs and last contact successes
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.

Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
2021-03-22 13:01:30 -04:00
Michał Niewrzał
d995fb497f Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I367da03351ab80f7343332420490dde9282aa47a
2021-02-23 12:31:31 +01:00
Cameron Ayer
549033f2e6 satellite/satellitedb: don't include DQd and exited nodes in DQStrayNodes
Don't update DQ time of already DQd nodes. Don't DQ nodes who exited.

Change-Id: I4528a9ba9f8e278987165ad337a9b34dadb9788b
2021-02-19 15:12:30 -05:00
Egon Elbre
1137620baf satellite/satellitedb: move tests to their domains
Testing interfaces is slightly clearer when it's in the package needing
the database rather than each individual implementation.

Change-Id: I10334c214a205f7e510b939b4359a2214c4e060a
2021-02-19 17:29:15 +02:00
JT Olio
b2ed7edd30 cmd/satellite: restore-trash parallel workers
Change-Id: Ic7466b21c20bda334e7ba4268a494e96b6528ac1
2021-02-18 19:11:19 +02:00
JT Olio
3ae3389ddc cmd/satellite: restore-trash command
Change-Id: I80fc932c12147692d49cde277784871ac611fcad
2021-02-18 09:19:22 -07:00
Michał Niewrzał
12402eb729 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I38adf8218c1415c7ea1910f8bd6bed13544b0f03
2021-02-17 08:50:38 +01:00
Egon Elbre
f7ad86521e satellite/overlay: fix data race in TestAuditHistoryBasic
Change-Id: I196f10973fe10b10b226ac3a63e62bf4fe9c256b
2021-02-16 12:32:19 +02:00
Michał Niewrzał
908a96ae30 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791
2021-02-11 11:48:23 +01:00
Yaroslav Vorobiov
966535e9de {storagenode,satellite}/nodeoperator: add wallet features
Change-Id: Iac7eb40a52b8fddcc573aebaad2e3a30a10cded9
2021-02-08 22:09:45 +02:00
Kaloyan Raev
d0612199f0 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/config.go
	satellite/metainfo/metainfo_test.go

Change-Id: I95cf3c1d020a7918795b5eec63f36112fdb86749
2021-02-01 14:32:12 +02:00
Egon Elbre
b7a0739219 satellite/overlay: use DownloadSelectionCache for getting node IPs
Change-Id: Ib8f4eedb2bf465767050693a1e961b37a294ca06
2021-01-29 16:47:10 +02:00
Egon Elbre
54e01d37f9 satellite/overlay: add DownloadSelectionCache
Change-Id: Ic0779280172325f8d03f55a2e9673722f72bdd44
2021-01-29 16:47:06 +02:00
Egon Elbre
19e3dc4ec0 satellite/overlay: rename NodeSelectionCache to UploadSelectionCache
It wasn't obvious that NodeSelectionCache was only for uploads.

Change-Id: Ifeeaa6fdb50a4b7916245b48d8634d70ac54459c
2021-01-28 14:56:53 +02:00
littleskunk
0b2568d712
satellite/overlay/straynodes: increase development duration without contact
Stopping storj-sim for over 5 minutes caused nodes to be disqualified. Set development max duration without contact to 300 days.
2021-01-26 12:24:39 +02:00
Kaloyan Raev
c24ada7114 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum

Change-Id: Icf7c029e9d800e5f6a9fdd208c36f28e05468690
2021-01-20 17:35:57 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Cameron Ayer
75d828200c private,satellite: add chore to dq stray nodes
Full scope:
private/testplanet,satellite/{overlay,satellitedb}

Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.

There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.

Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
2021-01-19 14:21:56 -05:00
Kaloyan Raev
6dff40f5c5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/metainfo.go

Change-Id: Ib5c49f3c911c58319855a171f9ce73657da976d9
2021-01-14 14:33:59 +02:00
Egon Elbre
85fb964afe satellite/{metainfo,overlay}: improvements to GetObjectIPs
* Deduplicate NodeID list prior to fetching IPs.
* Use NodeSelectionCache for fetching reliable IPs.
* Return number of segements, reliable pieces and all pieces.

Change-Id: I13e679caab275488b4037624b840a4068dad9589
2021-01-14 09:12:45 +00:00
Cameron Ayer
0403e99a5b satellite/{overlay,satellitedb}: remove unused methods for old downtime tracking
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.

Change-Id: Idb829d742e1f987e095604423fff656fe581183e
2021-01-11 15:21:28 +00:00
Michał Niewrzał
ec88d21a3c Merge 'main' branch.
Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae
2021-01-11 10:26:58 +01:00
Moby von Briesen
6e2ef3b9ee Revert "satellite/satellitedb: Do not consider nodes with offline_suspended as reputable."
This reverts commit e24262c2c9.

Change-Id: I287deb2e52d03bbd698ed055f0f216b0b5bf2798
2021-01-04 14:28:37 +00:00
Michał Niewrzał
ad3e3a38c5 Merge 'main' branch
Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200
2021-01-04 12:13:45 +01:00
Moby von Briesen
edbee53888 satellite,storagenode: Pass audit history over GetStats endpoint
Full prefix: satellite/{overlay,nodestats},storagenode/{reputation,nodestats}

Allow the storagenode to receive its audit history data from the
satellite via the satellite's GetStats endpoint.

The storagenode does not save this data for use in the API yet.

Change-Id: I9488f4d7a4ccb4ccf8336b8e4aeb3e5beee54979
2020-12-30 19:13:26 +00:00
Moby von Briesen
825dc71227 satellite/{overlay, satellitedb}: Refactor audit history
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation

Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
2020-12-29 18:50:22 +00:00
Moby von Briesen
e24262c2c9 satellite/satellitedb: Do not consider nodes with offline_suspended as reputable.
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.

Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
2020-12-29 17:59:09 +00:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Moby von Briesen
3fc76f4ffe satellite/downtime: Remove deprecated downtime tracking service.
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.

This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.

Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
2020-12-02 15:16:13 -05:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
11338e9beb satellite/internalpb: move audithistory.pb
Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4
2020-10-30 15:30:11 +02:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Egon Elbre
2268cc1df3 all: fix linter complaints
Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95
2020-10-13 15:59:01 +03:00
Cameron Ayer
b39a99bae6 satellite/{overlay,satellitedb}: always show node's real online score
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.

However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.

Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.

Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
2020-10-02 12:28:11 -04:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00