Commit Graph

172 Commits

Author SHA1 Message Date
paul cannon
c54c45c9c7 satellite/audit: new ReverifyPiece implementation
ReverifyPiece() is not currently hooked up to anything, but is planned
to take the place of audit.(*Verifier).Reverify().

ReverifyPiece() works by downloading one piece in its entirety, rather
than pulling an entire stripe across many nodes.

Change-Id: Ie2c680f4d3c3b65273a72466a3f9f55c115b0311
2022-10-27 16:06:21 +00:00
paul cannon
9c67f62fe3 satellite/satellitedb: add table for reverify queue
This table will be used as a queue for pieces that need to be reverified
(a regular audit timed out on the owning node, so now that node is
contained and we need to validate the piece before un-containing it).

Refs: https://github.com/storj/storj/issues/5228

Change-Id: I5dcd26b6adced8674cbd81884c1543a61ea9d4c8
2022-10-27 15:28:47 +00:00
JT Olio
58a9c55f36 mod: bump dependencies
- storj.io/common

Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8
2022-10-19 17:01:53 +00:00
Egon Elbre
8b70f969b6 all: fix nolint directives
Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6
2022-10-11 18:31:20 +00:00
Michal Niewrzal
e37435602f satellite/audit: optimize loop observer
Two things were done to optimize audit observer:
* monik call was removed as we have different way to track it
* no new allocation for audit.Segment struct inside observer

Benchmark against 'main':
name                                         old time/op    new time/op    delta
RemoteSegment/Cockroach/multiple_segments-8    5.85µs ± 1%    0.74µs ± 4%   -87.28%  (p=0.008 n=5+5)

name                                         old alloc/op   new alloc/op   delta
RemoteSegment/Cockroach/multiple_segments-8    2.72kB ± 0%    0.00kB           ~     (p=0.079 n=4+5)

name                                         old allocs/op  new allocs/op  delta
RemoteSegment/Cockroach/multiple_segments-8      50.0 ± 0%       0.0       -100.00%  (p=0.008 n=5+5)

Change-Id: Ib973e48782bad4346eee1cd5aee77f0a50f69258
2022-10-02 22:24:37 +00:00
paul cannon
802ff18bd8 satellite/audit: better handling of piece fetch errors
We have an alert on `not_enough_shares_for_audit` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the audit framework to act in that way.

Instead, in this change, we add three new metrics,
`audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and
`audit_suspected_network_problem`. When an audit fails, and emits
`not_enough_shares_for_audit`, we will now determine whether it looks
like we are having network problems (most errors are connection
failures, possibly also some successful connections which subsequently
time out) or whether something else has happened.

After this is deployed, we can remove the alert on
`not_enough_shares_for_audit` and add new alerts on
`audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`.
`audit_suspected_network_problem` does not need an alert.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb
2022-09-28 17:02:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
paul cannon
0dcc0a9ee0 satellite/reputation: reconfigure lambda and alpha
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.

For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at

    https://forum.storj.io/t/tuning-audit-scoring/14084

In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.

Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.

Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
2022-08-17 18:52:53 +00:00
paul cannon
37a4edbaff all: reformat comments as required by gofmt 1.19
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.

Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
2022-08-10 18:24:55 +00:00
Michal Niewrzal
6cc2052f47 satellite: fix segment loop observers metrics
We made optimization for segment loop observers to avoid
heavy monkit initialization on each call. It was applied to very
often executed methods. Unfortunately we used wrong monkit
method to track function times. Instead mon.Task we used
mon.Func().

https://github.com/spacemonkeygo/monkit#how-it-works

Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8
2022-08-10 14:13:16 +00:00
Egon Elbre
bc9ab8ee5e satellite/audit,storagenode/gracefulexit: fixes to limiter
Ensure we don't rely on limiter to wait multiple times.

Change-Id: I75d48420236216d4c2fc6fa99293f51f80cd9c33
2022-08-03 10:24:16 +03:00
paul cannon
2f20bbf4d8 satellite/reputation: add a reputation write cache
This should lower the amount of database load coming from
reputation updates.

Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949
2022-07-14 21:40:16 +00:00
Egon Elbre
b8006e192b satellite/audit: use a larger delay in the test
The MinDownloadTimeout 950ms and delay of 1s were quiet close, possibly
causing flaky behavior in TestVerifierSlowDownload.

Change-Id: I4f6c1554a118b21427357642abe39986fd0af38d
2022-06-28 17:03:23 +03:00
paul cannon
737d7c7dfc satellite/reputation: new ApplyUpdates() method
The ApplyUpdates() method on the reputation.DB interface acts like the
similar Update() method, but can allow for applying the changes from
multiple audit events, instead of only one.

This will be necessary for the reputation write cache, which will batch
up changes to each node's reputation in order to flush them
periodically.

Refs: https://github.com/storj/storj/issues/4601

Change-Id: I44cc47767ea2d9423166bb8fed080c8a11182041
2022-06-07 15:22:25 +00:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Michał Niewrzał
307295977d satellite/{audit,metrics}: optimize loop methods
What was applied:
* avoid extra map lookups
* reorganize monikit for less cpu usage

Change-Id: I70575f404f717f7905b27d43888cbd7489f0176d
2022-05-05 15:10:56 +00:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
Erik van Velzen
86d742f7c6 satellite/audit: verify auditing of copies
Check that audit works in the face of copies.

Closes https://github.com/storj/storj/issues/4695

Change-Id: I1ee79a73c28e3f4842eebe8c4e4cd9ecf2e51e57
2022-04-12 15:24:54 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Mya
05a17ef42d deps: upgrade storj.io/common
In addition to upgrading the storj.io/common library, this change
moves off the TCPConnector in favor of the HybridConnector per
the deprecation warning.

Change-Id: I7e7e1e7568e8b95e4a99ad9caa158a799e68e1e3
2022-02-16 18:59:19 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
dlamarmorgan
b3cea3d1b6 satellite/audit: account for piece size during audit reservoir sampling
Treat the piece size as a weight, and perform weighted reservoir sampling as given in Algorithm A-Chao (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao)

Change-Id: I299d0026d9e02d03b3d2130b0f32192928e6e326
2021-12-01 18:17:52 +00:00
Egon Elbre
8eebbf3d7d satellite/audit: fix TestReverify timeouts
Currently the slow db was sleeping for 1s and the timeout for audit was
1s. There's a slight chance that the timeout won't trigger on such a
small difference.

Increase the slow node sleep to 10x of the timeout.

Hopefully fixes #4268

Change-Id: Ifdab45141b3fc7c62bde11813dbc534b3255fe59
2021-11-09 13:16:29 +00:00
Cameron Ayer
56fe636123 satellite/{reputation/satellitedb}: remove references to contained column in reputations table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I5671722a5fc6e1749d3a49e187a56556000ff941
2021-10-14 19:59:03 +00:00
Cameron Ayer
bb21551a9c satellite/satellitedb: remove references to contained column in nodes table
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.

Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
2021-10-14 19:17:46 +00:00
Michał Niewrzał
1ed5db1467 satellite/metainfo: simplifying limits code
Its a very simple change to reduct code duplication.

Change-Id: Ia135232e3aefd094f76c6988e82e297be028e174
2021-09-28 06:22:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Yingrong Zhao
0b500a30e4 satellite/audit: move audit metrics out of reporter
Since we are sharing the reporting logic between repair and audit. We
need to remove metric reporting logic in reporter.

Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21
2021-09-16 17:58:56 +00:00
Egon Elbre
1aec831d98 satellite/audit,storage: increase sleep delay in TestMaxVerifyCount
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.

Also pass ctx to all the funcs so we can handle sleep better.

Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
2021-09-10 15:30:37 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Yaroslav Vorobiov
ee4361fe0d satellite/audit: fix segment stripes length calculation
GetRandomStripe function to randomly select a segment stripe to
audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`.
Since integer divsion truncates it leads to skipping last stripe if
its size is less than stripe size. Use `Redundancy.StripeCount` to
get correct stripe count.

Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b
2021-09-01 13:25:20 +03:00
Yingrong Zhao
b64d8084e1 satellite/audit: fix metric reporting when fail to complete an audit
Change-Id: I39df8d4291db35afbba824281cb23438a91c45db
2021-08-31 17:02:30 +00:00
Cameron Ayer
28cb690618 satellite/audit: log error and increment metric if shares cannot be verified
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.

Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
2021-08-27 15:28:16 +00:00
Cameron Ayer
24e02b6352 satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares
Increment a metric so we can get alerts. Wrap the error so we can search
the logs for it.

Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6
2021-08-26 20:14:05 +00:00
Cameron Ayer
5a1a29a62e satellite/audit: fix containment bug where nodes not removed
When a node gets enough timeouts, it is supposed to be removed
from pending_audits and get an audit failure. We would give them
a failure, but we missed the removal. This change fixes it.

Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b
2021-08-20 14:48:27 +00:00
Cameron Ayer
70296c5050 satellite/audit: change wording of audit worker error log
"audit failed" is already used when a node fails an audit. That makes
searching for this higher level audit worker error more difficult.
Additionally, the presence of errors from the audit worker doesn't
necessarily mean the audit failed. Reword the error message to
"error(s) during audit"

Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6
2021-08-20 13:27:16 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Yingrong Zhao
58238d850c satellite/{audit, accounting}: use reputation store in tests
Change-Id: I86a8ccf5dcee8d108196a9f67a476fe0ccbd8257
2021-07-28 13:21:55 -04:00
Yingrong Zhao
6c7bf357cd satellite/{reputation,audit,overlay}: replace overlay with reputation
package in audit

This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.

In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.

Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
2021-07-28 13:10:48 -04:00
Cameron Ayer
adc0fbddfa satellite/audit: don't fail nodes for audit if not enough pieces downloaded
In most situations where we would not get enough shares to complete
an audit, something has probably gone wrong like a forgotten delete,
and nodes should not be failed. We have an alert when this occurs.
Check the logs to see what happened. If we decide the nodes should
get audit failures, we can do it manually.

Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f
2021-07-20 20:28:18 +00:00
Michał Niewrzał
70e6cdfd06 satellite/audit: move to segmentloop
Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51
2021-06-28 11:32:00 +00:00
Michał Niewrzał
8ce619706b satellite/audit: migrate to new segment_pending_audit table
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.

Table pending_audits will be dropped with next release.

Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
2021-06-28 13:19:49 +02:00
JT Olio
6949dc0bac satellite/metaloop: missing monitoring on observers
Change-Id: I630fbb0448c8d08b426486b3e49abfbca03332a6
2021-06-15 13:39:13 +00:00
Jeff Wendling
d674bc9c52 satellite/audit: include failing segment info in logs
Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec
2021-06-10 13:47:22 +03:00
Jeff Wendling
944bceabcd satellite/audit: fix reservoir sampling bias
Change-Id: Icc522fd86538b8182a1b7d42c1588c32a257acaf
2021-06-10 13:47:22 +03:00
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Cameron Ayer
53322bb0a7 satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.

Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
2021-06-01 21:02:44 +00:00
Egon Elbre
10a0216af5 satellite/metainfo: use range for specifying download limit
Previously the object range was not used for calculating order limit.
This meant that even if you were downloading only a small range it would
account bandwidth based on the full segment.

This doesn't fully address the accounting since the lazy segment
downloads do not send their requested range nor requested limit.

Change-Id: Ic811e570c889be87bac4293547d6537a255078da
2021-06-01 09:36:55 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00