Commit Graph

147 Commits

Author SHA1 Message Date
Michał Niewrzał
1ed5db1467 satellite/metainfo: simplifying limits code
Its a very simple change to reduct code duplication.

Change-Id: Ia135232e3aefd094f76c6988e82e297be028e174
2021-09-28 06:22:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Yingrong Zhao
0b500a30e4 satellite/audit: move audit metrics out of reporter
Since we are sharing the reporting logic between repair and audit. We
need to remove metric reporting logic in reporter.

Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21
2021-09-16 17:58:56 +00:00
Egon Elbre
1aec831d98 satellite/audit,storage: increase sleep delay in TestMaxVerifyCount
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.

Also pass ctx to all the funcs so we can handle sleep better.

Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
2021-09-10 15:30:37 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Yaroslav Vorobiov
ee4361fe0d satellite/audit: fix segment stripes length calculation
GetRandomStripe function to randomly select a segment stripe to
audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`.
Since integer divsion truncates it leads to skipping last stripe if
its size is less than stripe size. Use `Redundancy.StripeCount` to
get correct stripe count.

Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b
2021-09-01 13:25:20 +03:00
Yingrong Zhao
b64d8084e1 satellite/audit: fix metric reporting when fail to complete an audit
Change-Id: I39df8d4291db35afbba824281cb23438a91c45db
2021-08-31 17:02:30 +00:00
Cameron Ayer
28cb690618 satellite/audit: log error and increment metric if shares cannot be verified
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.

Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
2021-08-27 15:28:16 +00:00
Cameron Ayer
24e02b6352 satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares
Increment a metric so we can get alerts. Wrap the error so we can search
the logs for it.

Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6
2021-08-26 20:14:05 +00:00
Cameron Ayer
5a1a29a62e satellite/audit: fix containment bug where nodes not removed
When a node gets enough timeouts, it is supposed to be removed
from pending_audits and get an audit failure. We would give them
a failure, but we missed the removal. This change fixes it.

Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b
2021-08-20 14:48:27 +00:00
Cameron Ayer
70296c5050 satellite/audit: change wording of audit worker error log
"audit failed" is already used when a node fails an audit. That makes
searching for this higher level audit worker error more difficult.
Additionally, the presence of errors from the audit worker doesn't
necessarily mean the audit failed. Reword the error message to
"error(s) during audit"

Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6
2021-08-20 13:27:16 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Yingrong Zhao
58238d850c satellite/{audit, accounting}: use reputation store in tests
Change-Id: I86a8ccf5dcee8d108196a9f67a476fe0ccbd8257
2021-07-28 13:21:55 -04:00
Yingrong Zhao
6c7bf357cd satellite/{reputation,audit,overlay}: replace overlay with reputation
package in audit

This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.

In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.

Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
2021-07-28 13:10:48 -04:00
Cameron Ayer
adc0fbddfa satellite/audit: don't fail nodes for audit if not enough pieces downloaded
In most situations where we would not get enough shares to complete
an audit, something has probably gone wrong like a forgotten delete,
and nodes should not be failed. We have an alert when this occurs.
Check the logs to see what happened. If we decide the nodes should
get audit failures, we can do it manually.

Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f
2021-07-20 20:28:18 +00:00
Michał Niewrzał
70e6cdfd06 satellite/audit: move to segmentloop
Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51
2021-06-28 11:32:00 +00:00
Michał Niewrzał
8ce619706b satellite/audit: migrate to new segment_pending_audit table
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.

Table pending_audits will be dropped with next release.

Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
2021-06-28 13:19:49 +02:00
JT Olio
6949dc0bac satellite/metaloop: missing monitoring on observers
Change-Id: I630fbb0448c8d08b426486b3e49abfbca03332a6
2021-06-15 13:39:13 +00:00
Jeff Wendling
d674bc9c52 satellite/audit: include failing segment info in logs
Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec
2021-06-10 13:47:22 +03:00
Jeff Wendling
944bceabcd satellite/audit: fix reservoir sampling bias
Change-Id: Icc522fd86538b8182a1b7d42c1588c32a257acaf
2021-06-10 13:47:22 +03:00
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Cameron Ayer
53322bb0a7 satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.

Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
2021-06-01 21:02:44 +00:00
Egon Elbre
10a0216af5 satellite/metainfo: use range for specifying download limit
Previously the object range was not used for calculating order limit.
This meant that even if you were downloading only a small range it would
account bandwidth based on the full segment.

This doesn't fully address the accounting since the lazy segment
downloads do not send their requested range nor requested limit.

Change-Id: Ic811e570c889be87bac4293547d6537a255078da
2021-06-01 09:36:55 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00
Cameron Ayer
bb343d9028 satellite/satellitedb: don't remove offline nodes from containment
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.

Add statement to record audit results in reverify tests to ensure no
more false positives.

Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
2021-05-03 16:05:55 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Egon Elbre
4c9ed64f75 satellite/metabase/metaloop: move loop under metabase
Currently the loop handling is heavily related to the metabase rather
than metainfo.

metainfo over time has become related to the "public API" for accessing
the metabase data.

Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.

Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
2021-04-22 12:58:09 +03:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Fadila Khadar
bde367ae73 satellite/gc: check on bloom filter creation date
Check that the bloom filter creation date is earlier than the
metainfo loop system time used for db scanning.

Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4
2021-04-14 16:40:37 +00:00
Egon Elbre
f19ef4afe5 satellite/metainfo/metaloop: move loop to a separate package
Change-Id: I94c931a27c1af6062185ec62688624ec02050f11
2021-03-23 15:37:34 +00:00
Michał Niewrzał
237782813b Merge remote-tracking branch 'origin/multipart-upload'
Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f
2021-03-23 09:44:49 +01:00
Michał Niewrzał
27ae0d1f15 satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.

Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
2021-03-22 08:12:56 +00:00
Cameron Ayer
a04495713d satellite/audit: add missing logs for audit failure conditions
Among other conditions, nodes fail audits by returning incorrect
data and by reaching the max reverify count, but we weren't logging
these events. This commit adds the missing logs.

Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f
2021-03-18 17:33:11 +00:00
Michał Niewrzał
67e26aafcd Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I9b183323cb470185be22f7c648bb76917d2e6fca
2021-03-10 08:53:38 +01:00
Cameron Ayer
a44974a2f9 satellite/audit: fix pointless containment deletions
Previously if node was not found in containment, it was
given the status, 'skipped'.

We later try to delete skipped nodes from containment.

To fix this, add a new status called 'remove' to differentiate
nodes which should be skipped and nodes which should be deleted.

Change-Id: Ic09e62dc9723c89d0c9f968ce68c039114a9d74e
2021-03-02 13:40:18 -05:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Egon Elbre
12055e7864 all: minor cleanups
Change-Id: I4248dbe36a62a223b06135254b32851485a2eec1
2020-12-16 10:47:46 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Kaloyan Raev
53b7fd7b00 satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.

For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.

Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
2020-11-24 11:09:48 +02:00
Moby von Briesen
db6bc6503d satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes.
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.

RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).

Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
2020-11-09 22:16:13 +00:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Kaloyan Raev
92a2be2abd satellite/metainfo: get away from using pb.Pointer in Metainfo Loop
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.

After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service

It would require additional refactoring in these two services before we
are able to clean this.

Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
2020-10-27 13:06:47 +00:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Kaloyan Raev
1f386db566
cmd/satellite: remove metainfo commands (#3955) 2020-10-22 13:33:09 +03:00
Kaloyan Raev
1aeb14e65e satellite/audit: do not delete expired segments
A year ago we made the audit service deleting expired segments.
Meanwhile, we introduced an expired deletetion sub-service in the
metainfo service which sole purpose is deleting expired segments.

Therefore, now we are removing this responsibility from the audit
service. It will continue to avoid reporting failures on expired
segments, but it would not delete them anymore.

We do this to cleanup responsibilities in advance of the metainfo
refactoring.

Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4
2020-10-22 08:24:16 +00:00
paul cannon
360ab17869 satellite/audit: use LastIPAndPort preferentially
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.

A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.

The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.

Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.

Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.

So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.

We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.

Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
2020-10-21 13:34:40 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Kaloyan Raev
e7f2ec7ddf satellite/audit: fix sanity check for verify-piece-hashes command
The VerifyPieceHashes method has a sanity check for the number pieces to
be removed from the pointer after the audit for verifying the piece
hashes.

This sanity check failed when we executed the command on the production
satellites because the Verify command removes Fails and PendingAudits
nodes from the audit report if piece_hashes_verified = false.

A new temporary UsedToVerifyPieceHashes flag is added to
audits.Verifier. It is set to true only by the verify-piece-hashes
command. If the flag is true then the Verify method will always include
Fails and PendingAudits nodes in the report.

Test case is added to cover this use case.

Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8
2020-10-07 17:17:48 +03:00
Yingrong Zhao
c085a17a52 bump common and uplink to latest
Change-Id: I717f0214dd9973acd51b7732c5d64587f610c805
2020-10-01 15:38:58 +00:00