Commit Graph

185 Commits

Author SHA1 Message Date
Andrew Harding
93fad70e4b satellite/audit: prevent accessing unset reservoir segments
This change fixes the access of unset segments and keys on the reservoir
when the reservoir size is less than the max OR the number of sampled
segments is smaller than the reservoir size. It does so by tucking away
the segments and keys behind methods that return properly sized slices
into the segments/keys arrays.

It also fixes a bug in the housekeeping for the internal index variable
that holds onto how many items in the array have been populated. As part
of this fix, it changes the type of index to int8, which reduces the
size of the reservoir struct by 8 bytes.

The tests have been updated to provide better coverage for this case.

Change-Id: I3ceb17b692fe456fc4c1ca5d67d35c96aeb0a169
2022-12-14 17:43:17 -07:00
paul cannon
47b9134f76 satellite/audit: add IdentifyContainedNodes
This method on the Verifier allows the caller to find, out of the nodes
holding pieces in a given segment, which ones are contained.

This method is not yet being used. It will be in a future commit.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I242cd999913ca4dabbe8a62767ed4869b31fca04
2022-12-13 20:46:43 +00:00
paul cannon
c575a21509 satellite/audit: add audit.ReverifyWorker
Here we add a worker class comparable to audit.Worker, which will be
responsible for pulling items off of the reverification queue and
calling reverifier.ReverifyPiece on them.

Note that piecewise reverification audits (which this will control) are
not yet being done. That is, nothing is being added to the
reverification queue at this point.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I94e28830e27caa49f2c8bd4a2336533e187ab69c
2022-12-12 11:57:08 +00:00
paul cannon
1854351da6 satellite/audit: teach Reporter about piecewise audits
The Reporter is responsible for processing results from auditing
operations, logging the results, disqualifying nodes that reached
the maximum reverification count, and passing the results on to
the reputation system.

In this commit, we extend the Reporter so that it knows how to process
the results of piecewise reverification audits.

We also change most reporter-related tests so that reverifications
happen as piecewise reverification audits, exercising the new code.

Note that piecewise reverification audits are not yet being done outside
of tests. In a later commit, we will switch from doing segmentwise
reverifications to piecewise reverifications, as part of the
audit-scaling effort.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9
2022-12-12 11:28:02 +00:00
paul cannon
231c783698 satellite/audit: fix reservoir sampling bias
While researching logs from a large set of audits, I noticed that nearly
all of them had streamIDs starting with 0 or 1. This seemed very odd,
because streamIDs are supposed to be pretty much entirely random, and
every hex digit from 0-f should have been represented with roughly equal
frequency.

It turned out that our A-Chao implementation of reservoir sampling is
flawed. As far as we can tell, so is the Wikipedia implementation. No
one has yet reviewed the original 1982 paper by Dr. Chao in enough
detail to know where the error originated, but we do know that we have
been auditing segments near the beginning of the segment loop (low
streamIDs) far more often than segments near the end of the segment loop
(high streamIDs).

This change uses an algorithm Wikipedia calls "A-Res" instead, and adds
a test to check for that sort of bias creeping back in somehow. A-Res
will be slightly slower than A-Chao, because of a few extra steps that
need to be done, but it does appear to be selecting items uniformly.

Change-Id: I45eba4c522bafc729cebe2aab6f3fe65cd6336be
2022-12-09 18:16:58 -06:00
paul cannon
35f60fd5eb satellite/audit: change signature of ReverifyPiece
First, adding a logger argument allows the caller to have a logger
already set up with whatever extra fields they want here.

Secondly, we need to return the Outcome instead of a simple boolean so
that it can be passed on to the Reporter later (need to make the right
decision on increasing reputation vs decreasing it).

Thirdly, we collect the cached reputation information from the overlay
when creating the Orders, and return it from ReverifyPiece. This will
allow the Reporter to determine what reputation-status fields need to be
updated, similarly to how we include a map of ReputationStatus objects
in an audit.Report.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I5700b9ce543d18b857b81e684323b2d21c498cd8
2022-12-07 18:32:42 +00:00
paul cannon
378b8915c4 satellite/{satellitedb,audit}: add NewContainment
NewContainment will replace Containment later in this commit chain, but
for now it is not yet being used.

NewContainment will allow a node to be contained for multiple pending
reverify jobs at a time. It is implemented by way of the reverify queue.

Refs: https://github.com/storj/storj/issues/5231
Change-Id: I126eda0b3dfc4710a88fe4a5f41780618ec19101
2022-12-07 18:03:37 +00:00
paul cannon
2617925f8d satellite/audit: Split Reverifier from Verifier
Adding a new worker comparable to Verifier, called Reverifier; as the
name suggests, it will be used for reverifications, whereas Verifier
will be used for verifications.

This allows distinct logging from the two classes, plus we can add some
configuration that is specific to the Reverifier.

There is a slight modification to GetNextJob that goes along with this.

This should have no impact on operational concerns.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: Ie60d2d833bc5db8660bb463dd93c764bb40fc49c
2022-12-05 09:55:49 -06:00
paul cannon
601874f79a satellite/audit: retire audit.Queues
audit.Queues was the previous method of passing stacks of segments for
audit to the verifier. As of commit 68f9ce4a, this is now happening
by way of the auditor queue (database-backed, so that communication can
happen between multiple peers). audit.Queues is no longer needed.

Refs: https://github.com/storj/storj/issues/5228
Change-Id: I46f2d48d655fb66366c92146cdb6b85aef200552
2022-12-01 13:12:43 +00:00
paul cannon
fba39b72b8 satellite/audit: add GetByNodeID to ReverifyQueue
GetByNodeID will allow querying the reverification queue to see if there
are any pending jobs for a given node ID. And thus, to see if that node
ID should be contained or not.

Some parameters on the other methods of the ReverifyQueue interface have
been changed to accept pointers; this was done ahead of the rest of the
changes for the reverification queue to better match the signatures of
the methods that these will replace once ReverifyQueue is actually being
used (meaning fewer changes to tests).

Refs: https://github.com/storj/storj/issues/5251
Change-Id: Ic38ce6d2c650702b69f1c7244a224f00a34893a1
2022-12-01 12:14:49 +00:00
paul cannon
b612ded9af satellite/audit: help performance of pushing to audit queue
The audit chore will be pushing a large number of segments to be
audited, and the db might choke on that large insert when under load.

This change divides the insert up into batches, which can be sized
however is optimal for the backing database. It also arranges for
segments to be inserted in the order of the primary key, which helps
performance on some systems.

Refs: https://github.com/storj/storj/issues/5228

Change-Id: I941f580f690d681b80c86faf4abca2995e37135d
2022-11-29 15:37:49 +00:00
paul cannon
8b494f3740 satellite/audit: use db for auditor queue
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.

This logical database is safely partitionable from the rest of
satelliteDB.

Refs: https://github.com/storj/storj/issues/5251

Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
2022-11-22 14:04:00 +00:00
paul cannon
1b5d953fd9 satellite/satellitedb: add queue for verifications
Since the auditor will be moving to a different process from the
metainfo loop, we need a way of communicating which segments have
been chosen for audit. This queue will be that communication, for now.

Contrast this with the queue for _re_verifications in commit 9c67f62f.

Refs: https://github.com/storj/storj/issues/5251
Change-Id: I9a269c7ef21e6c5e9c6e5e1f3db298fe159a8a79
2022-11-09 15:38:56 +00:00
paul cannon
c54c45c9c7 satellite/audit: new ReverifyPiece implementation
ReverifyPiece() is not currently hooked up to anything, but is planned
to take the place of audit.(*Verifier).Reverify().

ReverifyPiece() works by downloading one piece in its entirety, rather
than pulling an entire stripe across many nodes.

Change-Id: Ie2c680f4d3c3b65273a72466a3f9f55c115b0311
2022-10-27 16:06:21 +00:00
paul cannon
9c67f62fe3 satellite/satellitedb: add table for reverify queue
This table will be used as a queue for pieces that need to be reverified
(a regular audit timed out on the owning node, so now that node is
contained and we need to validate the piece before un-containing it).

Refs: https://github.com/storj/storj/issues/5228

Change-Id: I5dcd26b6adced8674cbd81884c1543a61ea9d4c8
2022-10-27 15:28:47 +00:00
JT Olio
58a9c55f36 mod: bump dependencies
- storj.io/common

Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8
2022-10-19 17:01:53 +00:00
Egon Elbre
8b70f969b6 all: fix nolint directives
Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6
2022-10-11 18:31:20 +00:00
Michal Niewrzal
e37435602f satellite/audit: optimize loop observer
Two things were done to optimize audit observer:
* monik call was removed as we have different way to track it
* no new allocation for audit.Segment struct inside observer

Benchmark against 'main':
name                                         old time/op    new time/op    delta
RemoteSegment/Cockroach/multiple_segments-8    5.85µs ± 1%    0.74µs ± 4%   -87.28%  (p=0.008 n=5+5)

name                                         old alloc/op   new alloc/op   delta
RemoteSegment/Cockroach/multiple_segments-8    2.72kB ± 0%    0.00kB           ~     (p=0.079 n=4+5)

name                                         old allocs/op  new allocs/op  delta
RemoteSegment/Cockroach/multiple_segments-8      50.0 ± 0%       0.0       -100.00%  (p=0.008 n=5+5)

Change-Id: Ib973e48782bad4346eee1cd5aee77f0a50f69258
2022-10-02 22:24:37 +00:00
paul cannon
802ff18bd8 satellite/audit: better handling of piece fetch errors
We have an alert on `not_enough_shares_for_audit` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the audit framework to act in that way.

Instead, in this change, we add three new metrics,
`audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and
`audit_suspected_network_problem`. When an audit fails, and emits
`not_enough_shares_for_audit`, we will now determine whether it looks
like we are having network problems (most errors are connection
failures, possibly also some successful connections which subsequently
time out) or whether something else has happened.

After this is deployed, we can remove the alert on
`not_enough_shares_for_audit` and add new alerts on
`audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`.
`audit_suspected_network_problem` does not need an alert.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb
2022-09-28 17:02:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
paul cannon
0dcc0a9ee0 satellite/reputation: reconfigure lambda and alpha
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.

For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at

    https://forum.storj.io/t/tuning-audit-scoring/14084

In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.

Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.

Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
2022-08-17 18:52:53 +00:00
paul cannon
37a4edbaff all: reformat comments as required by gofmt 1.19
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.

Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
2022-08-10 18:24:55 +00:00
Michal Niewrzal
6cc2052f47 satellite: fix segment loop observers metrics
We made optimization for segment loop observers to avoid
heavy monkit initialization on each call. It was applied to very
often executed methods. Unfortunately we used wrong monkit
method to track function times. Instead mon.Task we used
mon.Func().

https://github.com/spacemonkeygo/monkit#how-it-works

Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8
2022-08-10 14:13:16 +00:00
Egon Elbre
bc9ab8ee5e satellite/audit,storagenode/gracefulexit: fixes to limiter
Ensure we don't rely on limiter to wait multiple times.

Change-Id: I75d48420236216d4c2fc6fa99293f51f80cd9c33
2022-08-03 10:24:16 +03:00
paul cannon
2f20bbf4d8 satellite/reputation: add a reputation write cache
This should lower the amount of database load coming from
reputation updates.

Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949
2022-07-14 21:40:16 +00:00
Egon Elbre
b8006e192b satellite/audit: use a larger delay in the test
The MinDownloadTimeout 950ms and delay of 1s were quiet close, possibly
causing flaky behavior in TestVerifierSlowDownload.

Change-Id: I4f6c1554a118b21427357642abe39986fd0af38d
2022-06-28 17:03:23 +03:00
paul cannon
737d7c7dfc satellite/reputation: new ApplyUpdates() method
The ApplyUpdates() method on the reputation.DB interface acts like the
similar Update() method, but can allow for applying the changes from
multiple audit events, instead of only one.

This will be necessary for the reputation write cache, which will batch
up changes to each node's reputation in order to flush them
periodically.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I44cc47767ea2d9423166bb8fed080c8a11182041
2022-06-07 15:22:25 +00:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Michał Niewrzał
307295977d satellite/{audit,metrics}: optimize loop methods
What was applied:
* avoid extra map lookups
* reorganize monikit for less cpu usage

Change-Id: I70575f404f717f7905b27d43888cbd7489f0176d
2022-05-05 15:10:56 +00:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Erik van Velzen
86d742f7c6 satellite/audit: verify auditing of copies
Check that audit works in the face of copies.

Closes https://github.com/storj/storj/issues/4695

Change-Id: I1ee79a73c28e3f4842eebe8c4e4cd9ecf2e51e57
2022-04-12 15:24:54 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Mya
05a17ef42d deps: upgrade storj.io/common
In addition to upgrading the storj.io/common library, this change
moves off the TCPConnector in favor of the HybridConnector per
the deprecation warning.

Change-Id: I7e7e1e7568e8b95e4a99ad9caa158a799e68e1e3
2022-02-16 18:59:19 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
dlamarmorgan
b3cea3d1b6 satellite/audit: account for piece size during audit reservoir sampling
Treat the piece size as a weight, and perform weighted reservoir sampling as given in Algorithm A-Chao (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao)

Change-Id: I299d0026d9e02d03b3d2130b0f32192928e6e326
2021-12-01 18:17:52 +00:00
Egon Elbre
8eebbf3d7d satellite/audit: fix TestReverify timeouts
Currently the slow db was sleeping for 1s and the timeout for audit was
1s. There's a slight chance that the timeout won't trigger on such a
small difference.

Increase the slow node sleep to 10x of the timeout.

Hopefully fixes #4268

Change-Id: Ifdab45141b3fc7c62bde11813dbc534b3255fe59
2021-11-09 13:16:29 +00:00
Cameron Ayer
56fe636123 satellite/{reputation/satellitedb}: remove references to contained column in reputations table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I5671722a5fc6e1749d3a49e187a56556000ff941
2021-10-14 19:59:03 +00:00
Cameron Ayer
bb21551a9c satellite/satellitedb: remove references to contained column in nodes table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
2021-10-14 19:17:46 +00:00
Michał Niewrzał
1ed5db1467 satellite/metainfo: simplifying limits code
Its a very simple change to reduct code duplication.

Change-Id: Ia135232e3aefd094f76c6988e82e297be028e174
2021-09-28 06:22:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Yingrong Zhao
0b500a30e4 satellite/audit: move audit metrics out of reporter
Since we are sharing the reporting logic between repair and audit. We
need to remove metric reporting logic in reporter.

Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21
2021-09-16 17:58:56 +00:00
Egon Elbre
1aec831d98 satellite/audit,storage: increase sleep delay in TestMaxVerifyCount
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.

Also pass ctx to all the funcs so we can handle sleep better.

Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
2021-09-10 15:30:37 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Yaroslav Vorobiov
ee4361fe0d satellite/audit: fix segment stripes length calculation
GetRandomStripe function to randomly select a segment stripe to
audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`.
Since integer divsion truncates it leads to skipping last stripe if
its size is less than stripe size. Use `Redundancy.StripeCount` to
get correct stripe count.

Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b
2021-09-01 13:25:20 +03:00
Yingrong Zhao
b64d8084e1 satellite/audit: fix metric reporting when fail to complete an audit
Change-Id: I39df8d4291db35afbba824281cb23438a91c45db
2021-08-31 17:02:30 +00:00
Cameron Ayer
28cb690618 satellite/audit: log error and increment metric if shares cannot be verified
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.

Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
2021-08-27 15:28:16 +00:00
Cameron Ayer
24e02b6352 satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares
Increment a metric so we can get alerts. Wrap the error so we can search
the logs for it.

Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6
2021-08-26 20:14:05 +00:00
Cameron Ayer
5a1a29a62e satellite/audit: fix containment bug where nodes not removed
When a node gets enough timeouts, it is supposed to be removed
from pending_audits and get an audit failure. We would give them
a failure, but we missed the removal. This change fixes it.

Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b
2021-08-20 14:48:27 +00:00
Cameron Ayer
70296c5050 satellite/audit: change wording of audit worker error log
"audit failed" is already used when a node fails an audit. That makes
searching for this higher level audit worker error more difficult.
Additionally, the presence of errors from the audit worker doesn't
necessarily mean the audit failed. Reword the error message to
"error(s) during audit"

Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6
2021-08-20 13:27:16 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00