Commit Graph

74 Commits

Author SHA1 Message Date
Cameron
74ddfab810 satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode
Also, return node email from overlaycache db DisqualifyNode to be used
in node events insertion

Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6
2022-11-04 15:18:31 +00:00
Cameron
68fe26ebe5 satelite/overlay: insert reputation events into node events
Insert reputation event into node events if reputation change occurs.

Change-Id: If1c5526092cb6834fe2faa6aa6e0306d4d88a4b7
2022-11-02 18:32:20 +00:00
Cameron
865974950d satellite/overlay: insert node online events into node events table
Instead of sending emails at the time the node is seen to be back
online, we have decided to send the event to the node events table,
which will initiate the email sending process at some point.

Change-Id: Id756209498112579de8e78ee20ad2df54571a617
2022-11-02 16:26:19 +00:00
Cameron
f06da25c3d satellite/overlay: add nodeevents.DB to satellite overlay service
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.

Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
2022-11-02 15:56:37 +00:00
Cameron
a2ca443e29 satellite/overlay: send Node Online emails
Send an email when a node goes from offline to online.

Change-Id: I82d3f9001b9b0669e096d784edf097ffdcb1697d
2022-10-20 18:56:35 +00:00
Cameron
a52f766273 satellite/overlay: add email-sending functionality to overlay service
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails

Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
2022-10-13 18:01:05 +00:00
Egon Elbre
48b0a65fbd satellite/overlay: use ReadCache in Download/UploadSelectionCache
sync2.ReadCache implements preemptive refreshing preventing stalling
while it's being updated.

Change-Id: Iee9ef36049b986f0e426c14a139b2bc9ac17fb53
2022-07-12 13:52:48 +03:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Yaroslav Vorobiov
4223fa01f8 satellite/reputation: add disqualification reason for status update
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.

Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
2022-04-20 13:29:10 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Fadila Khadar
e776c65172 satellite/checker: pieces in excluded countries are not healthy
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.

Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.

Another change will handle the logic at the repairer level.

Fixes https://github.com/storj/team-metainfo/issues/95

Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
2022-03-02 09:59:09 +00:00
Egon Elbre
be02aa9b17 satellite/overlay: add test for UpdateCheckIn panic
The database table got invalid input and the resulting error
was not checked. This adds updates that contain invalid fields
to trigger different errors.

Change-Id: Iacea32cbef5599aab562c88e4113073596cc9996
2022-01-24 15:01:59 +02:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Egon Elbre
a42b9d1a48 all: fix uses of email.com
email.com is not a domain that should be used for examples
nor tests.

Change-Id: I654d4287d02633d5ed9740e81a79150470eeaf25
2022-01-05 16:29:19 +02:00
Cameron Ayer
bb21551a9c satellite/satellitedb: remove references to contained column in nodes table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
2021-10-14 19:17:46 +00:00
Yingrong Zhao
646ce5b8cc satellite/overlay: remove reputation logic from overlay
Change-Id: I3492860e4537c7a8e4e824ec4c9c8d179134a0c0
2021-07-28 15:15:28 -04:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Yingrong Zhao
e91574cee1 satellite/{reputation, gracefulexit}: use reputation store in
gracefulexit

With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score

Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
2021-07-28 13:21:41 -04:00
Yingrong Zhao
6c7bf357cd satellite/{reputation,audit,overlay}: replace overlay with reputation
package in audit

This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.

In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.

Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
2021-07-28 13:10:48 -04:00
Cameron Ayer
8c124c6fa4 satellite/{reputation,overlay,satellitedb}: create reputation service, DB, add overlay method UpdateReputation
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144

Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
2021-06-24 16:19:15 +00:00
Cameron Ayer
05f8d2d0b1 satellite/satellitedb: filter offline suspended nodes from selection
Change-Id: I5a6f413453332238d579a7bf50eb30e9156f96c2
2021-03-27 23:36:46 +00:00
Yaroslav Vorobiov
966535e9de {storagenode,satellite}/nodeoperator: add wallet features
Change-Id: Iac7eb40a52b8fddcc573aebaad2e3a30a10cded9
2021-02-08 22:09:45 +02:00
Egon Elbre
19e3dc4ec0 satellite/overlay: rename NodeSelectionCache to UploadSelectionCache
It wasn't obvious that NodeSelectionCache was only for uploads.

Change-Id: Ifeeaa6fdb50a4b7916245b48d8634d70ac54459c
2021-01-28 14:56:53 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Cameron Ayer
0403e99a5b satellite/{overlay,satellitedb}: remove unused methods for old downtime tracking
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.

Change-Id: Idb829d742e1f987e095604423fff656fe581183e
2021-01-11 15:21:28 +00:00
Moby von Briesen
6e2ef3b9ee Revert "satellite/satellitedb: Do not consider nodes with offline_suspended as reputable."
This reverts commit e24262c2c9.

Change-Id: I287deb2e52d03bbd698ed055f0f216b0b5bf2798
2021-01-04 14:28:37 +00:00
Moby von Briesen
e24262c2c9 satellite/satellitedb: Do not consider nodes with offline_suspended as reputable.
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.

Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
2020-12-29 17:59:09 +00:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Moby von Briesen
3fc76f4ffe satellite/downtime: Remove deprecated downtime tracking service.
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.

This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.

Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
2020-12-02 15:16:13 -05:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
stefanbenten
257855b5de all: replace == comparison with errors.Is
Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714
2020-07-14 15:50:25 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Cameron Ayer
bad299b541 satellite/satellitedb: serialize UpdateStats and BatchUpdateStats transactions
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.

Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
2020-06-10 17:11:28 +00:00
Egon Elbre
5d016425f1 satellite/{contact,downtime,overlay}: use NodeURL
Change-Id: I555a479a89e0ddbf0499898bdbc8574282cd6846
2020-05-20 11:09:05 +00:00
Egon Elbre
941d10cbc3 private/testplanet: remove Peer.Local()
Currently storagenode depends on overlay.NodeDossier, this is the first
step in removing it.

Change-Id: I034a3f1601835f8349bd41752455022e19bcc707
2020-05-20 11:05:34 +00:00
Jessica Grebenschikov
1ecbfc453f satellite/overlay: add random test for node selection from cache
Change-Id: I57b57db919b5afb3fc81176e078ac1e3a873565a
2020-05-18 16:59:52 +00:00
Jessica Grebenschikov
6a6427526b satellite/overlay: remove old updateaddress method
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.

Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
2020-04-30 06:41:48 +00:00
Egon Elbre
e26a5598ec satellite/overlay: fix UpdateCheckIn test
time.Now() in a short amount of time can return the exact same value.
Ensure that the test uses times that are distinct.

Change-Id: Ia653ce0af4bfcf7b5da133a9cf98b823033d9592
2020-04-29 17:59:00 +00:00
Moby von Briesen
d7794a4851 satellite/overlay: hardcode default values for audit alpha/beta
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.

Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
2020-04-14 19:12:40 +00:00
Natalie Villasana
cf80b3caf3
satellite/overlay: combine SelectStorageNodes and SelectNewStorageNodes (#3831) 2020-04-09 11:19:44 -04:00
Egon Elbre
439aba922a satellite/overlay: reduce overhead of GetNodes
Instead of filtering on the client side it's better to filter on the
database side.

Change-Id: I845fbbe5ed28c2ffdb0b8a3f789b59c094fd1069
2020-03-30 18:36:23 +03:00
Egon Elbre
cb781d66c7 satellite/overlay: optimize FindStorageNodes
Reduce the number of fields returned from the query.

Benchmark results in `satellite/overlay`:

benchstat before.txt after2.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32              7.85ms ± 1%  6.27ms ± 1%  -20.18%  (p=0.002 n=10+4)
SelectNewStorageNodes-32           8.21ms ± 1%  6.61ms ± 0%  -19.53%  (p=0.002 n=10+4)
SelectStorageNodesExclusion-32     17.2ms ± 1%  15.9ms ± 1%   -7.55%  (p=0.002 n=10+4)
SelectNewStorageNodesExclusion-32  17.8ms ± 2%  16.1ms ± 0%   -9.38%  (p=0.002 n=10+4)
FindStorageNodes-32                48.4ms ± 1%  45.1ms ± 0%   -6.69%  (p=0.002 n=10+4)
FindStorageNodesExclusion-32       79.2ms ± 1%  76.1ms ± 1%   -3.89%  (p=0.002 n=10+4)

Benchmark results from `satellite/overlay` after making them parallel:

benchstat before-parallel.txt after2-parallel.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32               548µs ± 1%   353µs ± 1%  -35.60%  (p=0.029 n=4+4)
SelectNewStorageNodes-32            562µs ± 0%   368µs ± 0%  -34.51%  (p=0.029 n=4+4)
SelectStorageNodesExclusion-32     1.02ms ± 1%  0.84ms ± 0%  -18.08%  (p=0.029 n=4+4)
SelectNewStorageNodesExclusion-32  1.03ms ± 1%  0.86ms ± 2%  -16.22%  (p=0.029 n=4+4)
FindStorageNodes-32                3.11ms ± 0%  2.79ms ± 1%  -10.27%  (p=0.029 n=4+4)
FindStorageNodesExclusion-32       4.75ms ± 0%  4.43ms ± 1%   -6.56%  (p=0.029 n=4+4)

Change-Id: I1d85e2764eb270f4c2b1998303ccfc1179d65b26
2020-03-30 18:36:23 +03:00
Jennifer Johnson
699b635e5d satellite/overlay: rename newNodePercentage to newNodeFraction
Change-Id: Ie66de91f88183b44de0773589e83e4ade9aa997a
2020-03-19 20:09:32 +00:00
Moby von Briesen
2f991b6c56 satellite/{overlay, satellitedb}: account for suspended field in overlay cache
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair

This commit also removes unused overlay functionality.

Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).

Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states

Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
2020-03-17 17:14:56 +00:00
Ethan
bdbf764b86 satellite/orders;overlay: Consolidate order limit storage node lookups into 1 query.
https: //storjlabs.atlassian.net/browse/SM-449
Change-Id: Idc62cc2978fba67cf48f7c98b27b0f996f9c58ac
2020-03-16 23:15:47 +00:00