Commit Graph

77 Commits

Author SHA1 Message Date
paul cannon
355ea2133b satellite/audit: remove pieces when audits fail
When pieces fail an audit (hard fail, meaning the node acknowledged it
did not have the piece or the piece was corrupted), we will now remove
those pieces from the segment.

Previously, we did not do this, and some node operators were seeing the
same missing piece audited over and over again and losing reputation
every time.

This change will include both verification and reverification audits. It
will also apply to pieces found to be bad during repair, if
repair-to-reputation reporting is enabled.

Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7
2023-06-22 14:19:00 +00:00
paul cannon
c2710cc78d satellite/audit: improve error handling
* Don't use rpcstatus.Unknown as an indicator of dial failure; instead,
  GetShare now indicates with a per-share field where a failure happened
  (DialFailure, RequestFailure, NoFailure). Use that information in
  Verify() to determine how to treat the source node.

* Add a test that replaces a storage node with a black hole, so that
  connections there will always time out. Make sure we handle that case
  correctly.

Refs: https://github.com/storj/storj/issues/5632
Change-Id: I513a53520fb48b7187d4c4d7e14e01e2cfc0a721
2023-05-11 22:55:26 +00:00
JT Olio
4362761fc7 satellite/audit: fix go1.19 dial timeouts and log more
Change-Id: Ide17c1b8e0ca8c86f305bea1b4ae553cc4cb60d0
2023-02-28 17:09:47 +00:00
paul cannon
fc905a15f7 satellite/audit: newContainment->containment
Now that all the reverification changes have been made and the old code
is out of the way, this commit renames the new things back to the old
names. Mostly, this involves renaming "newContainment" to "containment"
or "NewContainment" to "Containment", but there are a few other renames
that have been promised and are carried out here.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I34e2b857ea338acbb8421cdac18b17f2974f233c
2022-12-16 17:59:52 +00:00
paul cannon
0342ca1aa6 satellite/audit: delete now-unused code
Now that we are doing scalable piecewise reverifications, the code for
handling the old way of doing things (containment, pending audits,
reporting, testing) can now be removed.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ief1a75f423eff682e8f3d57804e343b3409a6631
2022-12-16 14:53:39 +00:00
paul cannon
a66503b444 satellite/audit: Begin using piecewise reverifications
This commit pulls the big switch! We have been setting up piecewise
reverifications (the workers for which can be scaled independently of
the core) for several commits now, and this commit actually begins
making use of them.

The core of this commit is fairly small, but it requires changing the
semantics in all the tests that relate to reverifications, so it ends up
being a large change. The changes to the tests are mostly mechanical and
repetitive, though, so reviewers needn't worry much.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18
2022-12-16 14:21:13 +00:00
paul cannon
47b9134f76 satellite/audit: add IdentifyContainedNodes
This method on the Verifier allows the caller to find, out of the nodes
holding pieces in a given segment, which ones are contained.

This method is not yet being used. It will be in a future commit.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: I242cd999913ca4dabbe8a62767ed4869b31fca04
2022-12-13 20:46:43 +00:00
paul cannon
378b8915c4 satellite/{satellitedb,audit}: add NewContainment
NewContainment will replace Containment later in this commit chain, but
for now it is not yet being used.

NewContainment will allow a node to be contained for multiple pending
reverify jobs at a time. It is implemented by way of the reverify queue.

Refs: https://github.com/storj/storj/issues/5231
Change-Id: I126eda0b3dfc4710a88fe4a5f41780618ec19101
2022-12-07 18:03:37 +00:00
JT Olio
58a9c55f36 mod: bump dependencies
- storj.io/common

Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8
2022-10-19 17:01:53 +00:00
paul cannon
802ff18bd8 satellite/audit: better handling of piece fetch errors
We have an alert on `not_enough_shares_for_audit` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the audit framework to act in that way.

Instead, in this change, we add three new metrics,
`audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and
`audit_suspected_network_problem`. When an audit fails, and emits
`not_enough_shares_for_audit`, we will now determine whether it looks
like we are having network problems (most errors are connection
failures, possibly also some successful connections which subsequently
time out) or whether something else has happened.

After this is deployed, we can remove the alert on
`not_enough_shares_for_audit` and add new alerts on
`audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`.
`audit_suspected_network_problem` does not need an alert.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb
2022-09-28 17:02:06 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Yingrong Zhao
0b500a30e4 satellite/audit: move audit metrics out of reporter
Since we are sharing the reporting logic between repair and audit. We
need to remove metric reporting logic in reporter.

Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21
2021-09-16 17:58:56 +00:00
Yaroslav Vorobiov
ee4361fe0d satellite/audit: fix segment stripes length calculation
GetRandomStripe function to randomly select a segment stripe to
audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`.
Since integer divsion truncates it leads to skipping last stripe if
its size is less than stripe size. Use `Redundancy.StripeCount` to
get correct stripe count.

Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b
2021-09-01 13:25:20 +03:00
Yingrong Zhao
b64d8084e1 satellite/audit: fix metric reporting when fail to complete an audit
Change-Id: I39df8d4291db35afbba824281cb23438a91c45db
2021-08-31 17:02:30 +00:00
Cameron Ayer
28cb690618 satellite/audit: log error and increment metric if shares cannot be verified
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.

Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
2021-08-27 15:28:16 +00:00
Cameron Ayer
24e02b6352 satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares
Increment a metric so we can get alerts. Wrap the error so we can search
the logs for it.

Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6
2021-08-26 20:14:05 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Cameron Ayer
adc0fbddfa satellite/audit: don't fail nodes for audit if not enough pieces downloaded
In most situations where we would not get enough shares to complete
an audit, something has probably gone wrong like a forgotten delete,
and nodes should not be failed. We have an alert when this occurs.
Check the logs to see what happened. If we decide the nodes should
get audit failures, we can do it manually.

Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f
2021-07-20 20:28:18 +00:00
Michał Niewrzał
70e6cdfd06 satellite/audit: move to segmentloop
Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51
2021-06-28 11:32:00 +00:00
Michał Niewrzał
8ce619706b satellite/audit: migrate to new segment_pending_audit table
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.

Table pending_audits will be dropped with next release.

Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
2021-06-28 13:19:49 +02:00
Jeff Wendling
d674bc9c52 satellite/audit: include failing segment info in logs
Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec
2021-06-10 13:47:22 +03:00
Cameron Ayer
53322bb0a7 satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.

Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
2021-06-01 21:02:44 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Michał Niewrzał
237782813b Merge remote-tracking branch 'origin/multipart-upload'
Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f
2021-03-23 09:44:49 +01:00
Cameron Ayer
a04495713d satellite/audit: add missing logs for audit failure conditions
Among other conditions, nodes fail audits by returning incorrect
data and by reaching the max reverify count, but we weren't logging
these events. This commit adds the missing logs.

Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f
2021-03-18 17:33:11 +00:00
Michał Niewrzał
67e26aafcd Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I9b183323cb470185be22f7c648bb76917d2e6fca
2021-03-10 08:53:38 +01:00
Cameron Ayer
a44974a2f9 satellite/audit: fix pointless containment deletions
Previously if node was not found in containment, it was
given the status, 'skipped'.

We later try to delete skipped nodes from containment.

To fix this, add a new status called 'remove' to differentiate
nodes which should be skipped and nodes which should be deleted.

Change-Id: Ic09e62dc9723c89d0c9f968ce68c039114a9d74e
2021-03-02 13:40:18 -05:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Egon Elbre
12055e7864 all: minor cleanups
Change-Id: I4248dbe36a62a223b06135254b32851485a2eec1
2020-12-16 10:47:46 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Kaloyan Raev
53b7fd7b00 satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.

For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.

Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
2020-11-24 11:09:48 +02:00
Kaloyan Raev
1f386db566
cmd/satellite: remove metainfo commands (#3955) 2020-10-22 13:33:09 +03:00
Kaloyan Raev
1aeb14e65e satellite/audit: do not delete expired segments
A year ago we made the audit service deleting expired segments.
Meanwhile, we introduced an expired deletetion sub-service in the
metainfo service which sole purpose is deleting expired segments.

Therefore, now we are removing this responsibility from the audit
service. It will continue to avoid reporting failures on expired
segments, but it would not delete them anymore.

We do this to cleanup responsibilities in advance of the metainfo
refactoring.

Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4
2020-10-22 08:24:16 +00:00
paul cannon
360ab17869 satellite/audit: use LastIPAndPort preferentially
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.

A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.

The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.

Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.

Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.

So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.

We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.

Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
2020-10-21 13:34:40 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Kaloyan Raev
e7f2ec7ddf satellite/audit: fix sanity check for verify-piece-hashes command
The VerifyPieceHashes method has a sanity check for the number pieces to
be removed from the pointer after the audit for verifying the piece
hashes.

This sanity check failed when we executed the command on the production
satellites because the Verify command removes Fails and PendingAudits
nodes from the audit report if piece_hashes_verified = false.

A new temporary UsedToVerifyPieceHashes flag is added to
audits.Verifier. It is set to true only by the verify-piece-hashes
command. If the flag is true then the Verify method will always include
Fails and PendingAudits nodes in the report.

Test case is added to cover this use case.

Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8
2020-10-07 17:17:48 +03:00
Kaloyan Raev
b409b53f7f cmd/satellite: command for verifying piece hashes
Jira: https://storjlabs.atlassian.net/browse/PG-69

There are a number of segments with piece_hashes_verified = false in
their metadata) on US-Central-1, Europe-West-1, and Asia-East-1
satellites. Most probably, this happened due to a bug we had in the
past. We want to verify them before executing the main migration to
metabase. This would simplify the main migration to metabase with one
less issue to think about.

Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236
2020-09-29 10:58:24 +00:00
Michal Niewrzal
aa47e70f03 satellite/metainfo: use metabase.SegmentKey with metainfo.Service
Instead of using string or []byte we will be using dedicated type
SegmentKey.

Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220
2020-09-03 15:11:32 +00:00
Egon Elbre
3ca405aa97 satellite/orders: use metabase types as arguments
Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef
2020-08-28 15:52:37 +03:00
Moby von Briesen
4f28bf0720 satellite/audit: Do not return errors from Verify or Reverify on segment modified, expired, or deleted
If a segment is deleted, is modified, or expires during an audit, this
is not problematic, so we should not return errors. Functionally,
nothing changes, but our metrics around audit success rate will be
improved after this change.

Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606
2020-08-26 13:24:00 +00:00
Qweder93
01bb2bd17d satellite/audit: verifier checks if node made sucess GE before auditing
Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80
2020-08-17 23:37:57 +03:00
Egon Elbre
94a09ce20b all: add missing dots
Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a
2020-08-11 17:50:01 +03:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
Egon Elbre
ed627144ed all: use DialNodeURL throughout the codebase
Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c
2020-05-20 10:36:30 +00:00
paul cannon
0c8c11b251 satellite/audit: add not_enough_shares_for_audit counter
We have been using the SQL expression `name='(*Verifier).Verify' AND
error_name='not enough shares for successful audit'` thus far to detect
cases of this problem and alert on them. Unfortunately, since this
rarely (hopefully never) happens, influxdb has no data for most of the
auditor instances, and when it has no data for a time series, it returns
no columns either. This makes Redash upset when it tries to perform a
query for an alert and can't find the column whose value it expects to
check.

This change should make it so zero values are reported when the problem
has not happened, and higher values when it has.

Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1
2020-04-03 17:00:50 +00:00
littleskunk
23e5a0471f
satellite/audit: clean up logging (#3832)
Co-authored-by: Ivan Fraixedes <ivan@fraixed.es>
2020-03-30 12:09:50 -06:00
Bill Thorp
94c11c5212 satellite: remove some unnecessary UTC() calls
Fixes some easy cases of extraneous UTC() calls

Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a
2020-03-13 13:49:44 +00:00
Jennifer Johnson
0d60c1a4b2 satellite/audit: fix checkSegmentAltered to detect segments that have changed during an audit
- Previously, checkSegmentAltered only checked for segments that were replaced
  but we want to detect all changes to a segment that occurred while an audit was being conducted.
- Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified
  segments were not being removed from containment mode.
- Filled in gaps in reverify testing to ensure nodes are properly removed from containment.

Change-Id: Icd96d369278987200fd28581395725438972b292
2020-03-05 19:05:39 +00:00