Commit Graph

214 Commits

Author SHA1 Message Date
paul cannon
915f3952af satellite/repair: repair pieces on the same last_net
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.

This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).

Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
2023-04-06 17:34:25 +00:00
JT Olio
2a63225b98 satellite/{contact,satellitedb}: preserve node message debounce support
Change-Id: I453ad35fda4e61d068db0c476dd86b50d7f2d705
2023-03-20 16:13:06 +00:00
paul cannon
97e20bc579 scripts/tests: fix rollingupgrade test even more
This might be pretty awful, but at least it is a complete and non-flaky
solution.

**Only when using the rollingupgrade test** (which implies a throwaway
satellite and also a PostgreSQL backend), create a trigger on the nodes
table which forces last_net to be equal to last_ip_port always.

Change-Id: I8448cf131e46576d96a414d06780270c7b2b1892
2023-03-13 15:49:07 +00:00
paul cannon
fd6ce6b9a5 scripts/tests: fix test-sim-rolling-upgrade.sh
This test involves a satellite with dev defaults (DistinctIP=no) being
upgraded past commit 2522ff09b6, which
means we need to run the dev-defaults-satellite-upgrade migration SQL
to avoid getting DistinctIP=yes behavior (which breaks the tests).

Change-Id: I29fb596d1ffa568dad635d98cfe9abacd3aaa48f
2023-03-09 23:35:36 +00:00
paul cannon
0e2fef977f satellite/overlay: add SetAllContainedNodes method to overlay.DB
We will be needing an infrequent chore to check which nodes are in the
reverify queue and synchronize that set with the 'contained' field in
the nodes db, since it is easily possible for them to get out of sync.
(We can't require that the reverification queue table be in the same
database as the nodes table, so maintaining consistency with SQL
transactions is out. Plus, even if they were in the same database, using
such SQL transactions to maintain consistency would be slow and
unwieldy.)

This commit adds a method to the overlay allowing the caller to set the
contained status of all nodes in the nodes table at once. This is valid
because our definition of "contained" now depends solely on whether a
node appears at least once in the reverification queue. Only rows whose
contained field does not match the expectation will be updated; the
contained timestamp will not be updated for a node which is supposed to
be contained and was already contained.

Change-Id: I8cabe56ad897b6027e11aa5b17175295391aa3ac
2023-02-06 10:18:54 +00:00
JT Olio
686faeedbd satellite/overlay: return noise info with selected nodes
we have two more fields in the database (noise_proto and
noise_public_key) that now need to go into pb.NodeAddress when
returning AddressedOrderLimits.

the only real complication is making sure type conversions between
database types and NodeURLs and so on don't lose this new
pb.NodeAddress field (NoiseInfo). otherwise this is a relatively
straightforward commit

Change-Id: I45b59d7b2d3ae21c2e6eb95497f07cd388d454b3
2023-02-02 15:46:27 +00:00
JT Olio
2753d5a32f satellite/overlay: keep track of noise info per node
Change-Id: Icef04c3e87dbf4bb57d3837274c323bf6dd2c81f
2023-02-01 23:03:35 -05:00
Cameron
0596651580 satellite/satellitedb: fix updating nodes.last_software_update_email
The CASE expression used to determine which value to set
last_software_update_email to did not have an ELSE clause. Therefore,
when the node is both below the minimum version and did not receive a
version update email (no condition is true), the value would be set to
NULL.

Additionally, replace `time.Now()` with `timestamp` in the check to
determine if the email cooldown has passed.

Change-Id: I2e2e93f1a865e123ed8b665be9621cebfb72236f
2023-02-01 17:25:58 -05:00
paul cannon
b6bcb32ecf satellite/reputation: more accurate "reputation changes" list
`overlay.(*Service).UpdateReputation()` takes a "reputationChanges"
parameter, a slice of node events indicating whether we think the node's
disqualification or suspension status is changing. This is necessary so
that the overlay service can notify the nodeevents DB about these
changes.

In several cases, however, this list of events is not constructed
correctly, because of missing information about the previous state.
In most cases, this is because the node was offline, and the order limit
creation functions (which usually obtain and return the prior reputation
status) ignored that node.

This change makes it so that all callers to
`overlay.(*Service).UpdateReputation()` can be expected to provide a
correct list of change events (as correct as feasible, given that we
can't lock the node's information in the database during the entire
operation).

It ended up that there was only one caller we needed to worry about, and
that was reputation.(*Service).ApplyAudit(). So the bulk of this change
is teaching that function how to recognize when the prior reputation
status was not filled in, and fill it in.

Refs: https://github.com/storj/storj/issues/5464
Change-Id: I52ce385fc9c0ce3b283b998d517998e7f4ec8792
2023-01-31 18:39:40 +00:00
JT Olio
e40191afd6 storj: upgrade to use latest storj/common NodeAddress
Change-Id: I5987391bcfe5f6dfd7b525698c337a4cbda9b76e
2023-01-25 01:37:26 +00:00
Fadila Khadar
5c3a148d6e satellite/overlaycache: fix typo in UpdateCheckIn request
- fix a malformed SQL query
- add test to be sure we don't have this problem again.

Change-Id: I3fde8c59ba01335411e51d964bec95bc26cfc961
2022-12-14 22:21:45 +01:00
paul cannon
ed0fa59f23 satellite/overlay: add SetNodeContained() method
SetNodeContained() will change the contained flag in the nodes table,
which will affect whether nodes are selected for new uploads. This flag
_should_ correlate with whether or not a given node has any entries in
the reverification queue. However, the reverification queue is intended
to be 'safely partitionable' from the nodes table, so we can't enforce
that characteristic transactionally. But this is ok; there are no dire
consequences if they are out of sync.

We will be adding a chore that updates the contained flag based on the
contents of the reverification queue periodically, if something fails
to set it directly when appropriate.

Refs: https://github.com/storj/storj/issues/5231
Change-Id: I26460d8718dee63fd55d00a44568b2065fc8fe30
2022-12-01 12:43:40 +00:00
Cameron
8681a36164 satellite/overlay: add ability to get offline nodes in need of email
Add LastOfflineEmail to overlay.NodeDossier. This is the last time a
node got an offline email. Add two new overlay db methods,
GetOfflineNodesForEmail and UpdateLastOfflineEmail. Edit db method
UpdateCheckIn to nullify last_offline_email if node is up.

Change-Id: I1ee60e7d98dd1b68348a57f9a4fb77c6c9895d6d
2022-11-17 19:03:04 +00:00
Cameron
2a25974261 satellite/overlay: insert node software update events into node events
When a node checks in and its version is below the minimum, insert
BelowMinVersion event into node events

Change-Id: I0e437ac34496778369515cbc40c15676da8b27ae
2022-11-11 20:43:56 +00:00
Cameron
cb0c359b81 satellite/overlay: insert DQ node events for stray nodes
Change-Id: I99da11e506ab7f6bcebdb08a5815078a3297c932
2022-11-04 15:48:17 +00:00
Cameron
74ddfab810 satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode
Also, return node email from overlaycache db DisqualifyNode to be used
in node events insertion

Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6
2022-11-04 15:18:31 +00:00
Cameron
8c8688ca6b satellite/overlay: return email as part of NodeReputation
Make email accessible for email sending after reputation updates

Change-Id: I760feee0f6ca58b76a2955a04c0c366c618656bb
2022-10-26 17:33:22 +00:00
Michal Niewrzal
a22e6bdf67 satellite/gc/bloomfilter: use int64 to count pieces
Pieces count in DB are stored as int64 and we would like to align bloom
filter processing with this type.

Change-Id: Iaec767e609a40d802077ae057520541805a7c44f
2022-09-22 09:39:53 +00:00
JT Olio
12ef68bdf4 satellitedb: restore-trash shouldn't bother with nodes we've never talked to
Change-Id: I3c631e92a9a2670c52c9fa23b2b1baf67d3c9a9c
2022-08-25 16:14:29 +00:00
Egon Elbre
65b5a0fe82 satellite/{overlay,satellitedb}: add AS OF to download selection
This should reduce the load caused by download selection queries.

Change-Id: Ic1a89d9c3eb5418f0792eb20ec2aece18dc63f2c
2022-06-28 05:18:01 +00:00
paul cannon
aa728bd6ea satellite/satellitedb: add reputations.disqualification_reason
We added nodes.disqualification_reason recently, but we didn't add a
corresponding column in the reputations table (despite having a
corresponding `disqualified` column there).

Without this change, the (very useful and informative) assignments to
updateFields.DisqualificationReason in reputations.go have no effect.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I77404902ca64b56aed72f1de76b303fe82b76aab
2022-05-17 10:09:36 -05:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Yaroslav Vorobiov
4223fa01f8 satellite/reputation: add disqualification reason for status update
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.

Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
2022-04-20 13:29:10 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Fadila Khadar
e776c65172 satellite/checker: pieces in excluded countries are not healthy
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.

Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.

Another change will handle the logic at the repairer level.

Fixes https://github.com/storj/team-metainfo/issues/95

Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
2022-03-02 09:59:09 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Márton Elek
76c2228fbd satellite/metainfo: propagate geofencing between buckets and stream id
Github: https://github.com/storj/storj/issues/4245

Change-Id: I83d34367aab1f3c0d46a044f54980b2d50174b19
2021-11-24 08:05:05 +00:00
Mya
bf51c286d9
satellite/geoip: update node check-in to associate a country code
Resolves https://github.com/storj/storj/issues/4247

Change-Id: Idfd71bf1795d48ca3c686066bbdb95b9c6594f00
2021-11-10 16:44:41 +01:00
Cameron Ayer
1de8a695e8 satellite/{overlay,satellitedb}: fix stray nodes DQ bug
We had a bug in the stray nodes chore where nodes who had not been seen
in several months were not being DQd. We figured out that this was
happening because we were using two queries: The first to grab
nodes where last_contact_success < some cutoff, the second to DQ them
unless last_contact_success == '0001-01-01 00:00:00+00'. The problem
is that if all of the nodes returned from the first query had
last_contact_success of '0001-01-01 00:00:00+00', we would pass them to
the second query which would not DQ them. This would result in the stray
nodes DQ loop ending since we found a number of nodes to DQ less than the
limit.

The fix: add the "WHERE last_contact_success != '0001-01-01
00:00:00+00'::timestamptz" to the selection query.

Change-Id: I4e60de90b68d8745d641b4467c2b23e0e56f7dff
2021-11-02 17:05:00 +00:00
Cameron Ayer
bb21551a9c satellite/satellitedb: remove references to contained column in nodes table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
2021-10-14 19:17:46 +00:00
Yingrong Zhao
6c34ff64ad satellite/satellitedb: remove referrence to audit information in
nodes and audit_history tables

This PR removes all code reference to audit_histories table and
```
audit_reputation_alpha, audit_reputation_beta,
unknown_audit_reputation_alpha, unknown_audit_reputation_beta,
```
columns from nodes table.

It also drops audit_histories table from the db since the code
that's referencing it currently are not being used.

Change-Id: Ifcda8db36afb3a333d487ff831f2fdefc8b02a4c
2021-08-13 21:11:28 +00:00
Yingrong Zhao
e4cc965c39 satellite/satellitedb: replace explicit transaction with dbx query for
UpdateReputation

Change-Id: I7c139ededea83d4b58107536c3a031c4f92d6eb4
2021-08-05 17:09:49 +00:00
Egon Elbre
65804801ec all: fix mon.Task leak
Change-Id: Ifd58c7ac5631b9c3c750b3f4cc50525167e90709
2021-08-05 14:07:45 +03:00
Yingrong Zhao
646ce5b8cc satellite/overlay: remove reputation logic from overlay
Change-Id: I3492860e4537c7a8e4e824ec4c9c8d179134a0c0
2021-07-28 15:15:28 -04:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Yingrong Zhao
e91574cee1 satellite/{reputation, gracefulexit}: use reputation store in
gracefulexit

With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score

Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
2021-07-28 13:21:41 -04:00
Yingrong Zhao
6c7bf357cd satellite/{reputation,audit,overlay}: replace overlay with reputation
package in audit

This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.

In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.

Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
2021-07-28 13:10:48 -04:00
Cameron Ayer
8c124c6fa4 satellite/{reputation,overlay,satellitedb}: create reputation service, DB, add overlay method UpdateReputation
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144

Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
2021-06-24 16:19:15 +00:00
Cameron Ayer
53322bb0a7 satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.

Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
2021-06-01 21:02:44 +00:00
Egon Elbre
cdcc67207c satellite/satellitedb: fix nil panic in UpdateCheckIn
Change-Id: If6ae2c3d9b7c269b0a9d652e68854091f668b5ec
2021-05-25 00:30:36 +03:00
Egon Elbre
8f15f975a2 satellite/overlay: improve contended update checkin
Improve UpdateCheckIn on a contended row:

  name                             old time/op  new time/op delta
  UpdateCheckInContended-100x-32   2.29s ±55%   0.17s ±61%  -92.45%  (p=0.008 n=5+5)

Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
2021-05-16 20:41:12 +03:00
Egon Elbre
0858c3797a satellite/{metabase,satellitedb}: deduplicate AS OF SYSTEM TIME code
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.

As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.

Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.

Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
2021-05-11 12:40:36 +03:00
Cameron Ayer
bb343d9028 satellite/satellitedb: don't remove offline nodes from containment
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.

Add statement to record audit results in reverify tests to ensure no
more false positives.

Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
2021-05-03 16:05:55 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Egon Elbre
a2e20c93ae private/dbutil: use dbutil and tagsql from storj.io/private
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.

Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
2021-04-23 14:36:52 +03:00
Cameron Ayer
a0c5da6643 satellite/satellitedb: in stray nodes DQ, don't DQ nodes where last_contact_success = '0001-01-01 00:00:00+00'
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.

Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz

Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
2021-04-22 10:13:13 -04:00
Jeff Wendling
a65aecfd98 compensation: always generate invoices for every node
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.

Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
2021-03-29 14:15:45 +00:00
Cameron Ayer
05f8d2d0b1 satellite/satellitedb: filter offline suspended nodes from selection
Change-Id: I5a6f413453332238d579a7bf50eb30e9156f96c2
2021-03-27 23:36:46 +00:00
Cameron Ayer
1a51049ac0 satellite/{overlay,satellitedb}: add flag to toggle suspending nodes for offline audits
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.

If the flag is set to true, nodes will be suspended if they meet the
requirements.

If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.

Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
2021-03-27 16:28:27 +00:00
Cameron Ayer
eb44dc21b4 satellite/satellitedb: select stray nodes for DQ in separate tx from update
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.

Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
2021-03-27 00:00:23 +00:00