Commit Graph

35 Commits

Author SHA1 Message Date
paul cannon
355ea2133b satellite/audit: remove pieces when audits fail
When pieces fail an audit (hard fail, meaning the node acknowledged it
did not have the piece or the piece was corrupted), we will now remove
those pieces from the segment.

Previously, we did not do this, and some node operators were seeing the
same missing piece audited over and over again and losing reputation
every time.

This change will include both verification and reverification audits. It
will also apply to pieces found to be bad during repair, if
repair-to-reputation reporting is enabled.

Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7
2023-06-22 14:19:00 +00:00
Michal Niewrzal
1aa24b9f0d satellite/audit: remove segments loop parts
We are switching completely to ranged loop.

https://github.com/storj/storj/issues/5368

Change-Id: I9cec0ac454f40f19d52c078a8b1870c4d192bd7a
2023-04-24 15:52:11 +00:00
Cameron
38f94c1c29 satellite/reputation: if node is DQd or exited skip applying audit
If a node is already disqualified or exited, don't bother going through
the audit reputation code. If there's an additional reputation update to
be made based on the automatic offline audit these nodes would get, the
overlay DB rejects it anyway.

Change-Id: I48ba151aa640b32f87e1285521be465d1f692a78
2023-02-21 18:31:21 +00:00
Moby von Briesen
395426977f satellite/reputation: Alter logic for hasReputationChanged
This change modifies "reputation change trigger" logic for dq and
suspension in order to:
* count only nil -> not-nil and not-nil -> nil as "reputation changes"
* do not count the timestamp updating (not-nil -> not-nil) as a
  reputation change

Change-Id: Ie57c97e80e5f3b368e2f2dc004e680a2f9416d1c
2023-02-03 21:17:57 -05:00
paul cannon
b6bcb32ecf satellite/reputation: more accurate "reputation changes" list
`overlay.(*Service).UpdateReputation()` takes a "reputationChanges"
parameter, a slice of node events indicating whether we think the node's
disqualification or suspension status is changing. This is necessary so
that the overlay service can notify the nodeevents DB about these
changes.

In several cases, however, this list of events is not constructed
correctly, because of missing information about the previous state.
In most cases, this is because the node was offline, and the order limit
creation functions (which usually obtain and return the prior reputation
status) ignored that node.

This change makes it so that all callers to
`overlay.(*Service).UpdateReputation()` can be expected to provide a
correct list of change events (as correct as feasible, given that we
can't lock the node's information in the database during the entire
operation).

It ended up that there was only one caller we needed to worry about, and
that was reputation.(*Service).ApplyAudit(). So the bulk of this change
is teaching that function how to recognize when the prior reputation
status was not filled in, and fill it in.

Refs: https://github.com/storj/storj/issues/5464
Change-Id: I52ce385fc9c0ce3b283b998d517998e7f4ec8792
2023-01-31 18:39:40 +00:00
paul cannon
7b851b42f7 satellite/audit: split out auditor process
This change creates a new independent process, the 'auditor', comparable
to the repairer, gc, and api processes. This will allow auditors to be
scaled independently of the core.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I8a29eeb0a6e35753dfa0eab5c1246048065d1e91
2022-12-16 12:44:32 -06:00
paul cannon
1854351da6 satellite/audit: teach Reporter about piecewise audits
The Reporter is responsible for processing results from auditing
operations, logging the results, disqualifying nodes that reached
the maximum reverification count, and passing the results on to
the reputation system.

In this commit, we extend the Reporter so that it knows how to process
the results of piecewise reverification audits.

We also change most reporter-related tests so that reverifications
happen as piecewise reverification audits, exercising the new code.

Note that piecewise reverification audits are not yet being done outside
of tests. In a later commit, we will switch from doing segmentwise
reverifications to piecewise reverifications, as part of the
audit-scaling effort.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9
2022-12-12 11:28:02 +00:00
Cameron
68fe26ebe5 satelite/overlay: insert reputation events into node events
Insert reputation event into node events if reputation change occurs.

Change-Id: If1c5526092cb6834fe2faa6aa6e0306d4d88a4b7
2022-11-02 18:32:20 +00:00
Cameron
4be042154c sdatellite/reputation: add overlay service to reputation service
Change-Id: I7b1af819f9e1989578335247730378a8658bfe7f
2022-10-26 17:03:29 +00:00
Jeremy Wharton
eae71ae06b satellite/reputation: remove tautology in node status comparison
The procedure responsible for node reputation status comparison could
return an invalid result due to comparing a status timestamp against
itself rather than comparing it against another. This results in
unnecessary database updates that could be avoided otherwise.
This change modifies the procedure to resolve this issue.

Change-Id: Id147e1942e994e8bca4ced2a9358f2474927d6ec
2022-10-24 17:14:13 +00:00
paul cannon
0dcc0a9ee0 satellite/reputation: reconfigure lambda and alpha
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.

For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at

    https://forum.storj.io/t/tuning-audit-scoring/14084

In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.

Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.

Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
2022-08-17 18:52:53 +00:00
paul cannon
37a4edbaff all: reformat comments as required by gofmt 1.19
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.

Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
2022-08-10 18:24:55 +00:00
paul cannon
799b159bba satellite/reputation: offset write times by random, not by satelliteID
In an effort to distribute load on the reputation database, the
reputation write cache scheduled nodes to be written at a time offset by
the local nodeID. The idea was that no two repair workers would have the
same nodeID, so they would not tend to write to the same row at the same
time.

Instead, since all satellite processes share the same satellite ID
(duh), this caused _all_ workers to try and write to the same row at the
same time _always_. This was not ideal.

This change uses a random number instead of the satellite ID. The random
number is sourced from the number of nanoseconds since the Unix epoch.
As long as workers are not started at the exact same nanosecond, they
ought to get well-distributed offsets.

Change-Id: I149bdaa6ca1ee6043cfedcf1489dd9d3e3c7a163
2022-08-03 21:14:06 +00:00
paul cannon
2f20bbf4d8 satellite/reputation: add a reputation write cache
This should lower the amount of database load coming from
reputation updates.

Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949
2022-07-14 21:40:16 +00:00
paul cannon
737d7c7dfc satellite/reputation: new ApplyUpdates() method
The ApplyUpdates() method on the reputation.DB interface acts like the
similar Update() method, but can allow for applying the changes from
multiple audit events, instead of only one.

This will be necessary for the reputation write cache, which will batch
up changes to each node's reputation in order to flush them
periodically.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I44cc47767ea2d9423166bb8fed080c8a11182041
2022-06-07 15:22:25 +00:00
paul cannon
5bc4c254d4 satellite/reputation: move Config fields into a Config struct
This has been a cause of some confusion, even though the fields are
labeled as being copies of config values.

Having them be under a field explicitly named "Config" makes this
clearer, plus, allows the values to be passed in simply as a copy
of the Config struct from the satellite, rather than copying the fields
individually (which can be error-prone, particularly as the AuditCount
field in UpdateRequest is apparently not the same thing as the
AuditCount field in reputation.Config).

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I386953347b71068596618616934aa28e3245cdc1
2022-05-24 20:57:51 +00:00
paul cannon
2ee9463e34 satellite/reputation: don't need 3 identical AuditHistory types
The two protobuf types are identical except that one is in our common/pb
package, and the other is in internalpb. Since the type is public
already, and there is no difference in the internal one, it seems better
to use the public one for all satellite needs.

There is also another type which is essentially identical, but which is
not a protobuf type, also called "AuditHistory". It looks like we don't
ever actually need to have a separate type from the protobuf one.

This change makes us use "storj/common/pb".AuditHistory for all of our
AuditHistory needs.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: If845fde21bb31c801db6d67ffc9a146d1617b991
2022-05-24 05:48:46 +00:00
paul cannon
951a842fd4 satellite/{reputation,satellitedb}: move addAudit
This functionality will be needed in both packages, so here we move it
into the more general reputation-code package and export it for use in
satellitedb.

This also removes the related UpdateAuditHistory() signature from the
reputation DB interface, since it doesn't have anything to do with the
db. It doesn't need to be a method, either.

Finally, this changes the test for addAudit to be a plain test function
instead of using testplanet.Run(). It didn't need a whole testplanet
setup or any databases.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I90f6a909e5404f03ad776b95cfa2f248308c57c1
2022-05-20 15:37:31 +00:00
paul cannon
ac7752ac3c satellite/reputation: return full Info from reputation update
This will let us update our reputation cache when writing through to the
db.

Since the information is already being fetched from the db and returned
to the application, the extra cpu load here should be minimal.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I2b8619f2c0d541893c7d3e7d33b1863b96775ebd
2022-05-18 17:22:06 +00:00
paul cannon
aa728bd6ea satellite/satellitedb: add reputations.disqualification_reason
We added nodes.disqualification_reason recently, but we didn't add a
corresponding column in the reputations table (despite having a
corresponding `disqualified` column there).

Without this change, the (very useful and informative) assignments to
updateFields.DisqualificationReason in reputations.go have no effect.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I77404902ca64b56aed72f1de76b303fe82b76aab
2022-05-17 10:09:36 -05:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Yaroslav Vorobiov
4223fa01f8 satellite/reputation: add disqualification reason for status update
Set disqualification reason when reputations stats are updated on DB.Update.
Added tests for DisqualifyNode and for disqualification cases which happens during Update.

Change-Id: I00130ab5d9722422805159ad2f183c205de60f7e
2022-04-20 13:29:10 +00:00
Yaroslav Vorobiov
ddbbb0038b satellite/satellitedb: remove suspension column from nodes and reputations
Remove redundant suspension timestamp column from nodes and reputation tables.
Suspended timestamp was moved to unknown_audit_suspended and suspended column is
no longer used so there is no point in keeping both.

Change-Id: Ieea3f12141b33ec9efe7594f4c9dbc7e10675b0e
2022-03-21 16:56:12 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Cameron Ayer
56fe636123 satellite/{reputation/satellitedb}: remove references to contained column in reputations table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I5671722a5fc6e1749d3a49e187a56556000ff941
2021-10-14 19:59:03 +00:00
Yingrong Zhao
c074a5666b satellite/satellitedb: improve Update query for reputation
Change-Id: Iee140f726cd05c34028c7b532e1f855e2473ddbc
2021-08-10 13:06:13 +00:00
Yingrong Zhao
ae02f6deda satellite/reputation: return default reputation stats when node is not
found

Change-Id: I587d0ab36ffa0efaf345a6a6e221ae5d2068e1c5
2021-08-04 19:34:54 +00:00
Yingrong Zhao
d3e51353ab satellite/satellitedb/reputation: replace audit_histories table with
reputation

Change-Id: I18417eaa0d146cec876020dc6a358d13992e1d5f
2021-07-28 19:21:16 -04:00
Yingrong Zhao
646ce5b8cc satellite/overlay: remove reputation logic from overlay
Change-Id: I3492860e4537c7a8e4e824ec4c9c8d179134a0c0
2021-07-28 15:15:28 -04:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Yingrong Zhao
e91574cee1 satellite/{reputation, gracefulexit}: use reputation store in
gracefulexit

With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score

Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
2021-07-28 13:21:41 -04:00
Yingrong Zhao
55a77d04bc satellite/satellitedb,private: add initial value on testplanet startup
Currently, reputation table is only populated when a node has been
audited. This is ok in production, however a lot of our tests doesn't
upload any data or trigger audits.
This PR adds an initialization step in testplanet to populate reputation
table with zero value for nodes reputation.

Change-Id: I11b381236669db346dc68a48a6d4a27334a0a8b8
2021-07-28 13:20:32 -04:00
Yingrong Zhao
6c7bf357cd satellite/{reputation,audit,overlay}: replace overlay with reputation
package in audit

This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.

In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.

Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
2021-07-28 13:10:48 -04:00
Cameron Ayer
8c124c6fa4 satellite/{reputation,overlay,satellitedb}: create reputation service, DB, add overlay method UpdateReputation
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144

Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
2021-06-24 16:19:15 +00:00