Commit Graph

104 Commits

Author SHA1 Message Date
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Moby von Briesen
5f0477ebe9 satellite/{overlay,satellitedb}: Create database functionality for updating audit history
Add a function to the overlay cache called UpdateAuditHistory, which
allows us to add online or offline audits to a particular node's audit
history, and get that node's "online score" for the configured tracking
period.

The next step will be to use UpdateAuditHistory from inside
BatchUpdateStats/UpdateStats, so that audit history is actually updated
when nodes get audited, and we can suspend nodes based on their online
score.

Change-Id: I2289105e6961e68e829a987ff756b0e576fab120
2020-08-20 17:34:27 +00:00
Qweder93
01bb2bd17d satellite/audit: verifier checks if node made sucess GE before auditing
Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80
2020-08-17 23:37:57 +03:00
Egon Elbre
94a09ce20b all: add missing dots
Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a
2020-08-11 17:50:01 +03:00
Qweder93
4ee1b2d45a storagenode/console: added list of all audits per satellite to sno dashboard/satellites
Change-Id: I52e58748d6467f372d9a308347fc77e400d137e2
2020-08-10 12:55:07 +00:00
Moby von Briesen
e02adfe5e9 satellite/overlay/config.go: Add AuditHistoryConfig to overlay
Adds AuditHistory{WindowSize, TrackingPeriod, GracePeriod,
OfflineThreshold}. These values will be used to track offline audits over
time, and to suspend/disqualify nodes for being offline for too long.

Change-Id: I05f7dbc3c034bdc53c4fbd7719c71a44f37ec6a5
2020-08-04 18:18:56 +00:00
Moby von Briesen
5dfe27f175 satellite/{repair,overlay}: Use overlay NodeSelectionCache for repair uploads
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.

When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.

Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
2020-08-04 12:50:12 -04:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
stefanbenten
257855b5de all: replace == comparison with errors.Is
Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714
2020-07-14 15:50:25 +00:00
Egon Elbre
5bdcd86fa7 ci: test benchmarks
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.

Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
2020-07-10 13:26:49 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Egon Elbre
f68e7b3fde satellite/overlay: replace pb.InfoResponse
pb.InfoResponse wasn't used for protocol buffer communication, but
instead as a satellite type.

Change-Id: I755619f2deec5b76c4fe488591b7d8c1b9fcdafb
2020-06-16 15:16:55 +03:00
Cameron Ayer
bad299b541 satellite/satellitedb: serialize UpdateStats and BatchUpdateStats transactions
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.

Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
2020-06-10 17:11:28 +00:00
Egon Elbre
c5d4a13158 satellite/nodeselection: use NodeURL
Change-Id: I2ebd4dbf993ff5c7864f3a3a665b5c8fc48aa7d1
2020-05-27 05:46:11 +00:00
littleskunk
2fbb34c3ea
nodeselection: Increase minimum free space to 500MB (#3898) 2020-05-25 12:13:28 +02:00
Egon Elbre
bef84a5f9d storagenode: remove dependency to overlay.NodeDossier
This is the last dependency from storage node to satellite.

Change-Id: I12f7abb91e84f823ba5af126c6e2979519838612
2020-05-21 08:37:13 +03:00
Jennifer Johnson
03e5f922c3 satellite/overlay: updates node with a vetted_at timestamp if they meet the vetting criteria
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.

Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.

Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.

Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
2020-05-20 16:30:26 -04:00
Egon Elbre
5d016425f1 satellite/{contact,downtime,overlay}: use NodeURL
Change-Id: I555a479a89e0ddbf0499898bdbc8574282cd6846
2020-05-20 11:09:05 +00:00
Egon Elbre
941d10cbc3 private/testplanet: remove Peer.Local()
Currently storagenode depends on overlay.NodeDossier, this is the first
step in removing it.

Change-Id: I034a3f1601835f8349bd41752455022e19bcc707
2020-05-20 11:05:34 +00:00
littleskunk
8ec64f3daf
satellite/overlay: enable node selection cache on all satellites (#3895) 2020-05-19 19:25:53 +02:00
Egon Elbre
16abf02b35 satellite/{nodeselection,overlay}: use the new package
Change-Id: I034fdbe578dec2e5c906aca82231cd3e56f26aeb
2020-05-18 21:38:43 +00:00
Jessica Grebenschikov
1ecbfc453f satellite/overlay: add random test for node selection from cache
Change-Id: I57b57db919b5afb3fc81176e078ac1e3a873565a
2020-05-18 16:59:52 +00:00
Egon Elbre
ec589a8289 all: fix comments about grpc
Change-Id: Id830fbe2d44f083c88765561b6c07c5689afe5bd
2020-05-11 13:05:34 +03:00
Egon Elbre
678b859172 satellite/overlay: remove MinimumRequiredNodes
In non-test code we were only using RequestedCount, not need to have
MinimumRequiredNodes.

Change-Id: I40736f4b028b41e94abfdeb221bce5aa86a5cb82
2020-05-07 15:41:23 +00:00
Egon Elbre
ce6a500b0c satellite/overlay: support DistinctIP=false in selection cache
Most of our tests and storj-sim are using DistinctIP.

Also fix bug with newNodeCount calculation.

Change-Id: I1a6d0efe7074908896a3322d19f487b929f0f0fc
2020-05-07 11:04:32 +00:00
Egon Elbre
2955c50bc1 satellite/overlay: fix data-race in node selection cache
GetNodes returned references to nodes in the immutable state, however
some parts of code expect them to be modified.

Change-Id: I5be1866f95e0dbe062a6b6be60e29f2365c35faa
2020-05-06 20:03:06 +03:00
Egon Elbre
4e94da3fda satellite/overlay: add feature flag for node selection cache
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.

Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
2020-05-06 16:13:47 +03:00
Moby von Briesen
8f60cfc4fb satellite/overlay: Add flag for enabling/disabling disqualification from suspension mode
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.

Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
2020-05-04 17:25:09 +00:00
Jessica Grebenschikov
6a6427526b satellite/overlay: remove old updateaddress method
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.

Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
2020-04-30 06:41:48 +00:00
Egon Elbre
e26a5598ec satellite/overlay: fix UpdateCheckIn test
time.Now() in a short amount of time can return the exact same value.
Ensure that the test uses times that are distinct.

Change-Id: Ia653ce0af4bfcf7b5da133a9cf98b823033d9592
2020-04-29 17:59:00 +00:00
Moby von Briesen
de366537a8 satellite/satellitedb/overlaycache: fix behavior around gracefully exited nodes
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)

Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
2020-04-28 23:58:43 +00:00
Jess G
825226c98e
satellite/overlay: use node selection cache for uploads (#3859)
* satellite/overlay: use node selection cache for uploads

Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad

* fix config lock

Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd

* start fixing tests

Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf

* fix test, add some more

Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722

* change config name

Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b

* add benchmarks

Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9

* revert bench to put in different PR

Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade

* add staleness to benchmark

Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870

* add cache config to testplanet

Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff

* have repair select old way

Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252

* lower testplante config time

Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7

* fix test

Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69
2020-04-24 09:11:04 -07:00
Jess G
7a4dcd61f7
satellite/overlay: add changes to selected node benchmarks (#3862)
* add changes to selected node benchmarks

Change-Id: I0259af155f9151cc2c7830d10f8907634c5e494f

* fix lint

Change-Id: I6c7b82bbfa579b468712f90fc03b12a931874a54

* restart jenkins

Change-Id: I1d7300343e94e695cd1c93a3b59895f52bbcb11e
2020-04-23 15:30:50 -07:00
Moby von Briesen
720e26d235 satellite/satellitedb/overlaycache: update unknown alpha/beta values properly
Update unknown_audit_reputation_alpha and unknown_audit_reputation_beta.
Add test to verify that BatchUpdateStats properly modifies unknown audit
alpha/beta

Change-Id: I0d5f9cac96a99f64905cf575b772402db0756a9d
2020-04-23 10:40:53 -04:00
Moby von Briesen
72b93f3120 satellite/satellitedb: disqualify suspended nodes when the grace period passes
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.

Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.

Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
2020-04-22 15:45:00 -04:00
Jess G
5ea1602ca5
satellite/overlay: add selected node cache (#3846)
* init implementation cache

Change-Id: Ia54a1943e0707a77189bc5f4a9aaa8339c98d99a

* one query to init cache

Change-Id: I7c04b3ae104b553ae23fca372351a4328f632c66

* add monit tracking of cache

Change-Id: I7d209e12c8f32d43708b23bf2126c5d5098e0a07

* add first test

Change-Id: I0646a9349d457a9eb3920f7cd2d62fb72ffc3ab5

* add staleness to cache

Change-Id: If002329bfdd53a4b200ad14dbd2ffc8b280aedb8

* add init test

Change-Id: I3a3d0aa74cfac1d125fa93cb749316ed2a74d5b1

* fix comment

Change-Id: I73353d00ccf0952b38c0f8ef7d1755c15cbfe9d9

* mv to nodeselection pkg

Change-Id: I62487f768296c7a7b597fa398a4c42daf6e9c5b7

* add state to cache

Change-Id: I081e77ec0e16706faee1a267de9a7fa643d6ac11

* add refresh concurrent test

Change-Id: Idcba72508291099f280edc65355273c0acc3d3ce

* add a few more tests

Change-Id: I9422e9eaa22bf01c11f14bdb892ebcf7b3e5e5fb

* fix tests, add min version to select allnodes

Change-Id: I926f41d568951ad4ff70c6d4ceb87abb1e3e5009

* update comments

Change-Id: I6ffe33e245ca65fb523c880cd72e63ce35776eb9

* fixes and rm Init

Change-Id: Ifbe09b668978b5d9af09ca38cb080d02a2154cf4

* fix format

Change-Id: I03cc217e28dc1839190c5c6dbdbb602c132a5a38
2020-04-14 13:50:02 -07:00
Moby von Briesen
d7794a4851 satellite/overlay: hardcode default values for audit alpha/beta
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.

Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
2020-04-14 19:12:40 +00:00
Natalie Villasana
cf80b3caf3
satellite/overlay: combine SelectStorageNodes and SelectNewStorageNodes (#3831) 2020-04-09 11:19:44 -04:00
Natalie Villasana
0a1bbc9824 satellite/overlay: add TestEnsureMinimumRequested
Adds a test to make sure that the correct amount of total nodes,
reputable nodes, and new nodes are returned by SelectStorageNodes
in different cases.

Change-Id: I0939159600afde8a46c35735f1edf0576fcdb4cd
2020-04-08 16:06:25 +00:00
Egon Elbre
439aba922a satellite/overlay: reduce overhead of GetNodes
Instead of filtering on the client side it's better to filter on the
database side.

Change-Id: I845fbbe5ed28c2ffdb0b8a3f789b59c094fd1069
2020-03-30 18:36:23 +03:00
Egon Elbre
cb781d66c7 satellite/overlay: optimize FindStorageNodes
Reduce the number of fields returned from the query.

Benchmark results in `satellite/overlay`:

benchstat before.txt after2.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32              7.85ms ± 1%  6.27ms ± 1%  -20.18%  (p=0.002 n=10+4)
SelectNewStorageNodes-32           8.21ms ± 1%  6.61ms ± 0%  -19.53%  (p=0.002 n=10+4)
SelectStorageNodesExclusion-32     17.2ms ± 1%  15.9ms ± 1%   -7.55%  (p=0.002 n=10+4)
SelectNewStorageNodesExclusion-32  17.8ms ± 2%  16.1ms ± 0%   -9.38%  (p=0.002 n=10+4)
FindStorageNodes-32                48.4ms ± 1%  45.1ms ± 0%   -6.69%  (p=0.002 n=10+4)
FindStorageNodesExclusion-32       79.2ms ± 1%  76.1ms ± 1%   -3.89%  (p=0.002 n=10+4)

Benchmark results from `satellite/overlay` after making them parallel:

benchstat before-parallel.txt after2-parallel.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32               548µs ± 1%   353µs ± 1%  -35.60%  (p=0.029 n=4+4)
SelectNewStorageNodes-32            562µs ± 0%   368µs ± 0%  -34.51%  (p=0.029 n=4+4)
SelectStorageNodesExclusion-32     1.02ms ± 1%  0.84ms ± 0%  -18.08%  (p=0.029 n=4+4)
SelectNewStorageNodesExclusion-32  1.03ms ± 1%  0.86ms ± 2%  -16.22%  (p=0.029 n=4+4)
FindStorageNodes-32                3.11ms ± 0%  2.79ms ± 1%  -10.27%  (p=0.029 n=4+4)
FindStorageNodesExclusion-32       4.75ms ± 0%  4.43ms ± 1%   -6.56%  (p=0.029 n=4+4)

Change-Id: I1d85e2764eb270f4c2b1998303ccfc1179d65b26
2020-03-30 18:36:23 +03:00
Egon Elbre
a6540dc3ef satellite/overlay: remove unused KeyLock
Change-Id: Ie99f97772824ceafb3d97453545fc6e96be2fb6f
2020-03-30 16:47:48 +03:00
Egon Elbre
c970969503 satellite/overlay: add benchmark for node selection
Change-Id: I15b767a78b662f8276e656b3fb73a15ec59e76c8
2020-03-27 23:09:29 +02:00
Jennifer Johnson
699b635e5d satellite/overlay: rename newNodePercentage to newNodeFraction
Change-Id: Ie66de91f88183b44de0773589e83e4ade9aa997a
2020-03-19 20:09:32 +00:00
Moby von Briesen
2f991b6c56 satellite/{overlay, satellitedb}: account for suspended field in overlay cache
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair

This commit also removes unused overlay functionality.

Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).

Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states

Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
2020-03-17 17:14:56 +00:00
Ethan
bdbf764b86 satellite/orders;overlay: Consolidate order limit storage node lookups into 1 query.
https: //storjlabs.atlassian.net/browse/SM-449
Change-Id: Idc62cc2978fba67cf48f7c98b27b0f996f9c58ac
2020-03-16 23:15:47 +00:00
Moby von Briesen
8b72181a1f satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.

Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
2020-03-16 20:29:26 +00:00