Commit Graph

28 Commits

Author SHA1 Message Date
Andrew Harding
590d44301c satellite/audit: implement rangedloop observer
This change implements the ranged loop observer to replace the audit
chore that builds the audit queue.

The strategy employed by this change is to use a collector for each
segment range to  build separate per-node segment reservoirs that are
then merge them during the join step.

In previous observer migrations, there were only a handful of tests so
the strategy was to duplicate them. In this package, there are dozens
of tests that utilize the chore. To reduce code churn and maintenance
burden until the chore is removed, this change introduces a helper that
runs tests under both the chore and observer, providing a pair of
functions that can be used to pause or run the queueing function.

https://github.com/storj/storj/issues/5232

Change-Id: I8bb4b4e55cf98b1aac9f26307e3a9a355cb3f506
2023-01-03 08:52:01 -07:00
paul cannon
7b851b42f7 satellite/audit: split out auditor process
This change creates a new independent process, the 'auditor', comparable
to the repairer, gc, and api processes. This will allow auditors to be
scaled independently of the core.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I8a29eeb0a6e35753dfa0eab5c1246048065d1e91
2022-12-16 12:44:32 -06:00
paul cannon
a66503b444 satellite/audit: Begin using piecewise reverifications
This commit pulls the big switch! We have been setting up piecewise
reverifications (the workers for which can be scaled independently of
the core) for several commits now, and this commit actually begins
making use of them.

The core of this commit is fairly small, but it requires changing the
semantics in all the tests that relate to reverifications, so it ends up
being a large change. The changes to the tests are mostly mechanical and
repetitive, though, so reviewers needn't worry much.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18
2022-12-16 14:21:13 +00:00
paul cannon
c575a21509 satellite/audit: add audit.ReverifyWorker
Here we add a worker class comparable to audit.Worker, which will be
responsible for pulling items off of the reverification queue and
calling reverifier.ReverifyPiece on them.

Note that piecewise reverification audits (which this will control) are
not yet being done. That is, nothing is being added to the
reverification queue at this point.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I94e28830e27caa49f2c8bd4a2336533e187ab69c
2022-12-12 11:57:08 +00:00
paul cannon
1854351da6 satellite/audit: teach Reporter about piecewise audits
The Reporter is responsible for processing results from auditing
operations, logging the results, disqualifying nodes that reached
the maximum reverification count, and passing the results on to
the reputation system.

In this commit, we extend the Reporter so that it knows how to process
the results of piecewise reverification audits.

We also change most reporter-related tests so that reverifications
happen as piecewise reverification audits, exercising the new code.

Note that piecewise reverification audits are not yet being done outside
of tests. In a later commit, we will switch from doing segmentwise
reverifications to piecewise reverifications, as part of the
audit-scaling effort.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9
2022-12-12 11:28:02 +00:00
paul cannon
b612ded9af satellite/audit: help performance of pushing to audit queue
The audit chore will be pushing a large number of segments to be
audited, and the db might choke on that large insert when under load.

This change divides the insert up into batches, which can be sized
however is optimal for the backing database. It also arranges for
segments to be inserted in the order of the primary key, which helps
performance on some systems.

Refs: https://github.com/storj/storj/issues/5228

Change-Id: I941f580f690d681b80c86faf4abca2995e37135d
2022-11-29 15:37:49 +00:00
paul cannon
8b494f3740 satellite/audit: use db for auditor queue
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.

This logical database is safely partitionable from the rest of
satelliteDB.

Refs: https://github.com/storj/storj/issues/5251

Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
2022-11-22 14:04:00 +00:00
Egon Elbre
bc9ab8ee5e satellite/audit,storagenode/gracefulexit: fixes to limiter
Ensure we don't rely on limiter to wait multiple times.

Change-Id: I75d48420236216d4c2fc6fa99293f51f80cd9c33
2022-08-03 10:24:16 +03:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Cameron Ayer
70296c5050 satellite/audit: change wording of audit worker error log
"audit failed" is already used when a node fails an audit. That makes
searching for this higher level audit worker error more difficult.
Additionally, the presence of errors from the audit worker doesn't
necessarily mean the audit failed. Reword the error message to
"error(s) during audit"

Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6
2021-08-20 13:27:16 +00:00
Michał Niewrzał
70e6cdfd06 satellite/audit: move to segmentloop
Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51
2021-06-28 11:32:00 +00:00
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Moby von Briesen
5d21e85529 satellite/audit/queue: Separate audit queue into two separate structs.
* The audit worker wants to get items from the queue and process them.
* The audit chore wants to create new queues and swap them in when the
old queue has been processed.

This change adds a "Queues" struct which handles the concurrency
issues around the worker fetching a queue and the chore swapping a new
queue in. It simplifies the logic of the "Queue" struct to its bare
bones, so that it behaves like a normal queue with no need to understand
the details of swapping and worker/chore interactions.

Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b
2020-08-31 20:51:25 +00:00
Jennifer Johnson
18078bf7ee satellite/audit: increases audit worker concurrency to 2
Change-Id: Ibe3e3801b79accffbcfe9e2e02c96fc963894a7f
2020-05-05 11:31:55 +00:00
Egon Elbre
8dea4f52db satellite: add control panel
Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b
2020-01-30 08:06:43 -05:00
Egon Elbre
f41d440944 all: reduce number of log messages
Remove starting up messages from peers. We expect all of them to start,
if they don't, then they should return an error why they don't start.
The only informative message is when a service is disabled.

When doing initial database setup then each migration step isn't
informative, hence print only a single line with the final version.

Also use shorter log scopes.

Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93
2020-01-06 19:03:46 +00:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
Jeff Wendling
17b057b33e satellite/audit: monitor worker function
Change-Id: I94d1161deffe4ea9782abee1afbb5735f18aab44
2019-11-25 17:58:13 +00:00
Maximillian von Briesen
8653dda2b1 satellite/audit: do not contain nodes for unknown errors (#3592)
* skip unknown errors (wip)

* add tests to make sure nodes that time out are added to containment

* add bad blobs store

* call "Skipped" "Unknown"

* add tests to ensure unknown errors do not trigger containment

* add monkit stats to lockfile

* typo

* add periods to end of bad blobs comments
2019-11-19 17:30:28 +01:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
littleskunk
def3dcbaa9
satellite/audit: increase timeout to 5 minutes (#3480)
* satellite/audit: increase timeout to 5 minutes

* fix lint error
2019-11-05 11:21:25 +01:00
littleskunk
eeb38245ff
satellite/audit: improve logging (#3285) 2019-10-16 13:48:05 +02:00
Maximillian von Briesen
784ca1582a
satellite/audit: fix audit panic (#3217) 2019-10-09 10:06:58 -04:00
Natalie Villasana
4aaf525bd3
satellite/audit: set devDefaults for ChoreInterval and QueueInterval to 1m (#3058) 2019-09-16 16:36:33 -04:00
Natalie Villasana
aa3567187e
satellite/audit: worker now verifies and reverifies (#2965) 2019-09-11 18:37:01 -04:00
Natalie Villasana
6d363fb756
satellite/audit: create the audit queue, chore, and worker (#2888) 2019-09-05 11:40:52 -04:00