Commit Graph

290 Commits

Author SHA1 Message Date
paul cannon
8b494f3740 satellite/audit: use db for auditor queue
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.

This logical database is safely partitionable from the rest of
satelliteDB.

Refs: https://github.com/storj/storj/issues/5251

Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
2022-11-22 14:04:00 +00:00
Cameron
74ddfab810 satellite/overlay: insert DQ event into node events in overlay.DisqualifyNode
Also, return node email from overlaycache db DisqualifyNode to be used
in node events insertion

Change-Id: I41534cf01351c1690c3966a8055c5fe6fcf0d6a6
2022-11-04 15:18:31 +00:00
Cameron
f06da25c3d satellite/overlay: add nodeevents.DB to satellite overlay service
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.

Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
2022-11-02 15:56:37 +00:00
JT Olio
58a9c55f36 mod: bump dependencies
- storj.io/common

Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8
2022-10-19 17:01:53 +00:00
Cameron
a52f766273 satellite/overlay: add email-sending functionality to overlay service
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails

Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
2022-10-13 18:01:05 +00:00
Egon Elbre
8b70f969b6 all: fix nolint directives
Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6
2022-10-11 18:31:20 +00:00
Egon Elbre
ff22fc7ddd all: fix deprecated ioutil commands
Change-Id: I59db35116ec7215a1b8e2ae7dbd319fa099adfac
2022-10-11 15:27:29 +00:00
Michal Niewrzal
5dc5f076c9 satellite/repair/checker: remove monitoring from fast methods
It looks that monikt monitoring can give high CPU overhead for
segments loop observer. With this code we are changing how monitoring
is initialized for observer methods. This optimization affects mainly
path where segment is healthy and doesn't require repair. Benchmark
is also added to show difference between old and new approach.

Benchmark against 'main':
name                                       old time/op    new time/op    delta
RemoteSegment/Cockroach/healthy_segment-8    8.55µs ± 4%    1.37µs ± 6%  -84.03%  (p=0.008 n=5+5)

name                                       old alloc/op   new alloc/op   delta
RemoteSegment/Cockroach/healthy_segment-8    2.63kB ± 0%    0.17kB ± 0%  -93.62%  (p=0.008 n=5+5)

name                                       old allocs/op  new allocs/op  delta
RemoteSegment/Cockroach/healthy_segment-8      54.0 ± 0%       8.0 ± 0%  -85.19%  (p=0.008 n=5+5)

Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18
2022-10-03 12:15:03 +00:00
Michal Niewrzal
1aecca1e76 satellite/repair/checker: tiny cleanup
* unused slice removed
* variable moved closer to place of use

Change-Id: I86126b8337225d4b31cabf89bc9640add7409398
2022-09-26 11:20:10 +00:00
paul cannon
7f1cad6faf satellite/repair: better handling of piece fetch errors
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.

Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.

We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
2022-09-23 09:35:06 +00:00
paul cannon
7d0885bbaa satellite/repair: move over audit.Pieces
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.

Also, we take the opportunity here to rename the structure to something
less generic.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
2022-09-22 16:43:03 +00:00
Márton Elek
4b1be6bf8e storagenode/satellite: support different piece hash algorithms
Change-Id: I3db321e79f12f3ebaa249e6c32fa37fd9615687e
2022-08-23 18:15:06 +00:00
paul cannon
0dcc0a9ee0 satellite/reputation: reconfigure lambda and alpha
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.

For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at

    https://forum.storj.io/t/tuning-audit-scoring/14084

In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.

Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.

Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
2022-08-17 18:52:53 +00:00
paul cannon
37a4edbaff all: reformat comments as required by gofmt 1.19
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.

Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
2022-08-10 18:24:55 +00:00
Michal Niewrzal
6cc2052f47 satellite: fix segment loop observers metrics
We made optimization for segment loop observers to avoid
heavy monkit initialization on each call. It was applied to very
often executed methods. Unfortunately we used wrong monkit
method to track function times. Instead mon.Task we used
mon.Func().

https://github.com/spacemonkeygo/monkit#how-it-works

Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8
2022-08-10 14:13:16 +00:00
paul cannon
726c95160b satellite/repair: avoid retrying GET_REPAIR incorrectly
We retry a GET_REPAIR operation in one case, and one case only (as far
as I can determine): when we are trying to connect to a node using its
last known working IP and port combination rather than its supplied
hostname, and we think the operation failed the first time because of a
Dial failure.

However, logs collected from storage node operators along with logs
collected from satellites are strongly indicating that we are retrying
GET_REPAIR operations in some cases even when we succeeded in connecting
to the node the first time. This results in the node complaining loudly
about being given a duplicate order limit (as it should), whereupon the
satellite counts that as an unknown error and potentially penalizes the
node.

See discussion at
https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36
.

Investigation into this problem has revealed that
`!piecestore.CloseError.Has(err)` may not be the best way of determining
whether a problem occurred during Dial. In fact, it is probably
downright Wrong. Handling of errors on a stream is somewhat complicated,
but it would appear that there are several paths by which an RPC error
originating on the remote side might show up during the Close() call,
and would thus be labeled as a "CloseError".

This change creates a new error class, repairer.ErrDialFailed, with
which we will now wrap errors that _really definitely_ occurred during
a Dial call. We will use this class to determine whether or not to retry
a GET_REPAIR operation. The error will still also be wrapped with
whatever wrapper classes it used to be wrapped with, so the potential
for breakage here should be minimal.

Refs: https://github.com/storj/storj/issues/4687
Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb
2022-07-18 05:11:56 +00:00
paul cannon
2f20bbf4d8 satellite/reputation: add a reputation write cache
This should lower the amount of database load coming from
reputation updates.

Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949
2022-07-14 21:40:16 +00:00
Egon Elbre
48b0a65fbd satellite/overlay: use ReadCache in Download/UploadSelectionCache
sync2.ReadCache implements preemptive refreshing preventing stalling
while it's being updated.

Change-Id: Iee9ef36049b986f0e426c14a139b2bc9ac17fb53
2022-07-12 13:52:48 +03:00
Erik van Velzen
f23d5eb5a1 satellite/repair: remove superfluous conditional
Change-Id: If80ae0a1a4ee436763ed437fc77b0ed26db17a68
2022-06-30 18:09:17 +00:00
Michał Niewrzał
7a2d2a36ca satellite: use more optimal monkit call for loop observers methods
Recently we applied this optimization to metrics observer and time
used by its method dropped from 12m to 3m for us1 (220m segments).
It looks that it make sense to apply the same code to all observers.

Change-Id: I05898aaacbd9bcdf21babc7be9955da1db57bdf2
2022-05-20 11:03:41 +00:00
Erik van Velzen
db1cc8ca95 satellite/repair/checker: buffer repair queue
Integrate previous changes. Speed up the segment loop by batch inserting
into repair queue.

Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276
2022-05-12 16:28:05 +00:00
Erik van Velzen
928375a67c satellite/repair/queue: buffer batch insert
Implement a buffer for inserting repair items into the queue in a batch.

Part of https://github.com/storj/storj/issues/4727

Change-Id: I718472b2f2b1f4993c3d6f15c44923776407155a
2022-05-11 09:02:20 +00:00
paul cannon
fd01c6cc25 satellite/{repair,audit}: simplify reputation reporter
Also, make it an interface so that the upcoming write cache can be
dropped in to the same place.

Change-Id: I2c286743825e647c0cef5b6578245391851fa10c
2022-05-10 14:04:43 +00:00
Erik van Velzen
26f495f717 satellite/repair: implementation of batch insert
Part of https://github.com/storj/storj/issues/4727

Change-Id: I44990a7614af26f8ee0be9c7aed496a1dd9e5df7
2022-05-09 12:41:22 +00:00
Erik van Velzen
10d71a8a3c satellite/satellitedb: outline for batch insert
Part of https://github.com/storj/storj/issues/4727

Change-Id: I1a9ad3b009f363e37f5e68e810074eecb7448db3
2022-05-09 11:39:52 +00:00
Fadila Khadar
af1f0aa943 satellite/repair: lighten tests covering excluded countries
TestSegmentInExcludedCountriesRepair and TestSegmentInExcludedCountriesRepairIrreparable are using 20 storage nodes.
This change make them use 7 by adjusting the test redundancy scheme.

Change-Id: I1a44aa8b997d6edcc9a3305fdd0dac57e4d525b5
2022-05-04 07:51:59 +00:00
Yaroslav Vorobiov
3f47d19aa6 satellite/overlay: add disqualification reason
Add disqualification reason to NodeDossier.
Extend DB.DisqualifyNode with disqualification reason.
Extend reputation Service.TestDisqualifyNode with disqualification reason.

Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d
2022-04-20 13:29:31 +00:00
paul cannon
985ccbe721 satellite/repair: in dns redial, don't retry if CloseError
To save load on DNS servers, the repair code first tries to dial the
last known good ip and port for a node, and then falls back to a DNS
lookup only if we fail to connect to the last known good ip and port.

However, it looks like we are seeing errors during the client stream
Close() call (probably due to quic-go code), and those are classified
the same as errors encountered during Dial. The repairer code sees this
error, assumes that we failed to contact the node, and retries- but
since we did actually succeed in connecting the first time around, this
results in submitting the same order limit (with the same serial number)
to the storage node, which (rightfully) rejects it.

So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in
storj/uplink, this should avoid the double submission and avoid dinging
nodes' suspension scores unfairly.

See https://github.com/storj/storj/issues/4687.

Also, moving the testsuite directory check up above check-monkit in the
Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized
and handled before everything breaks weirdly and seemingly randomly
later on.

Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655
2022-04-04 17:01:09 +00:00
Egon Elbre
566fc8ee25 satellite/repair: test inmemory/disk difference only once
We don't need to have every single test for both, only one for
each should be sufficient. For all other tests it doesn't matter
which one we use.

Change-Id: I9962206a4ee025d367332c29ea3e6bc9f0f9a1de
2022-03-29 14:08:13 +03:00
Qweder93
8b0988708a satellite/repair: add test that confirms that repairer is ignoring copied segments
Resolves https://github.com/storj/storj/issues/4485

Change-Id: Ic772643520124fe3f7eacf8b3bfbbb38982d4769
2022-03-16 09:00:34 +00:00
Fadila Khadar
29fd36a20e satellite/repairer: handle excluded countries
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.

Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
2022-03-14 10:59:36 -04:00
Márton Elek
b3675c14d4 repairer: log piece id in case of a repair error
Change-Id: Ia8da2da491a6674f669e62148fa42538278119ba
2022-03-09 17:34:14 +00:00
Fadila Khadar
e776c65172 satellite/checker: pieces in excluded countries are not healthy
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.

Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.

Another change will handle the logic at the repairer level.

Fixes https://github.com/storj/team-metainfo/issues/95

Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
2022-03-02 09:59:09 +00:00
paul cannon
12b3fb5fb0 cmd/satellite: add fetch-pieces command
The "satellite fetch-pieces" command allows a satellite operator to
fetch as many pieces of a segment as possible, along with their
original order limits and hashes as provided by the storage nodes. The
fetched pieces and associated info will be stored on in a specified
folder as they are, rather than being RS-decoded or decrypted.

It is hoped that this will allow easier debugging of certain one-off
problems we've observed in the wild.

Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a
2022-02-18 00:13:53 +00:00
Michał Niewrzał
bc161794fc satellite/metabase: drop DeleteObjectLatestVersion method
This method was never used, except tests.

Change-Id: Idc1e69b2e2971995b5c4e6cf78a2b5fc69f39ad2
2022-02-02 14:33:48 +00:00
Yingrong Zhao
1f8f7ebf06 satellite/{audit, reputation}: fix potential nodes reputation status
inconsistency

The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.

Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
2022-01-06 21:05:59 +00:00
Yingrong Zhao
336500c04d satellite/repair: only record audit result if segment can be downloaded
If satellite can't find enough nodes to successfully download a segment,
it probably is not the fault of storage nodes.

Change-Id: I681f66056df0bb940da9edb3a7dbb3658c0a56cb
2021-11-17 15:25:43 +00:00
Artur M. Wolff
89639199fe satellite/repair/repairer: remove unused healthyMap
This change removes unused healthyMap from (*SegmentRepairer).Repair.

Change-Id: Ie80eefdb5b7125bf70986cb13462eee737af214c
2021-11-16 16:49:18 +01:00
Yingrong Zhao
35e4a87e60 satellite/repair: ignore expired segments at the beginning of the repair
work

Since we have changed the repair worker to also mark a node as audit
failure if they return a not found error, we should ignore expired
segments when possible

Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c
2021-10-18 18:15:39 +00:00
Michał Niewrzał
1fdb0eaa5b Revert "satellite/metabase: use storj.Nonce instead []byte"
This change introduce problems with server side move so
let's revert it for now. Problem was found when latest
version of storj/storj was used in uplink tests.

This reverts commit 1ef06fae99.

Change-Id: I4d4fad5d1ea04ba15ff9d7bd765f7e078e9187c2
2021-10-12 15:39:54 +02:00
Michał Niewrzał
1ef06fae99 satellite/metabase: use storj.Nonce instead []byte
We were using mixed types for nonce fields. Protobuf
have storj.Nonce, metabase have []byte. This change
is a refactoring to have everywere its possible only
storj.Nonce.

Change-Id: Id54bd8481f30c721cdaf3df79206d25e7cfdab55
2021-10-11 16:13:34 +00:00
Yaroslav Vorobiov
4b79f5ea86 satellite/repair: test if audit scores increases during repair
Update repair tests to check if audit score increases for nodes
that successfully send pieces during successfull and failed repairs.

Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9
2021-10-04 19:39:13 +00:00
Yaroslav Vorobiov
469ae72c19 satellite/repair: update audit records during repair
Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0
2021-09-24 00:48:13 +00:00
Michał Niewrzał
c258f4bbac private/testplanet: move Metabase outside Metainfo for satellite
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.

Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
2021-09-09 07:15:51 +00:00
Cameron Ayer
a7cda642a5 satellite/repair: add logging for irreparable segments in checker
If the checker sees an irreparable segment, log out some info
so we can see what the problem is

Change-Id: I76eda5270214205f4fefc646d6c391cc13ddcafd
2021-09-02 12:35:29 -04:00
Cameron Ayer
51fdceafef satellite/repair: increment repair_too_many_nodes_failed with 0 for redash alerting
Change-Id: I990c8df7be30493705278b24954262834a1ed81f
2021-08-27 17:42:11 +00:00
Cameron Ayer
26f839a445 satellite/repair/repairer: if not enough nodes for repair order limits, increment metric and log as irreparable segment
Change-Id: I4bd46f28d64278c8d463e885ad221aafb6ce7cf3
2021-08-27 13:42:28 +00:00
Cameron Ayer
dc69e1b16e satellite/repair: use mutex instead of channel to collect download errors
Change-Id: I3f958e9cc95126a25f73ccd105e614b51089edc5
2021-08-10 15:29:39 +00:00
Cameron Ayer
a8f125c671 satellite:{audit,repair}: log additional info when we can't download enough pieces
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.

In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.

Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
2021-08-09 22:57:49 +00:00
Clement Sam
1f353f3231 segment/{metabase,repair}: change segment created_at column to not accept nulls
This change adds a NOT NULL constraint to the created_at column in the segment table.
All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc)

Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52
2021-08-06 08:16:28 +00:00
Clement Sam
f06e7c5f60 segment/{metabase,repair}: add dedicated methods on metabase.Pieces
This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates.

Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc
2021-08-03 15:12:03 +00:00
Michał Niewrzał
0d8e7905c1 satellite/repair/checker: don't return error when joining loop
Error from joining loop should not restart satellite. This will be the
same error like for loop itself. In the same way we are handling joining
error for other services that are using segment loop.

Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e
2021-08-03 12:56:42 +00:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Michał Niewrzał
a883d7f582 satellite/repair/checker: fix remote_files_checked metric
While metaloop refactoring we missed metric for all
objects processed by repair checker.

Change-Id: I100f10a36c52e2651923ecaa377261752877d673
2021-07-22 14:48:08 +00:00
Cameron Ayer
449c873681 satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.

Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
2021-07-21 13:13:06 +00:00
Cameron Ayer
373ba8fd27 satellite/repair/repairer: metrics for repair bytes uploaded and downloaded
Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48
2021-07-21 09:23:19 +00:00
Michał Niewrzał
b900f6b4f9 satellite/repair/checker: move checker to segment loop
Change-Id: I04b25e4fa14c822c9524586c25bde89db2a6cad9
2021-07-01 13:51:56 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Michał Niewrzał
a93e47514a satellite: remove irreparabledb
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.

Later we still need to drop table with migration step.

Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
2021-06-17 07:20:15 +00:00
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Egon Elbre
10372afbe4 ci: fix lint errors
Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf
2021-05-17 13:37:31 +00:00
Cameron Ayer
3ea7aa2c7a satellite/repair/repairer: log piece hash verification failures
Piece hash verification failures during repair download are considered
audit failures, but we are not logging these occurrences. Now we log
them.

Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653
2021-05-14 15:03:15 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Michał Niewrzał
7944df20d6 storj: use multipart API
Change-Id: I10b401434e3e77468d12ecd225b41689568fd197
2021-04-26 13:15:09 +00:00
Egon Elbre
a2e20c93ae private/dbutil: use dbutil and tagsql from storj.io/private
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.

Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
2021-04-23 14:36:52 +03:00
Egon Elbre
4c9ed64f75 satellite/metabase/metaloop: move loop under metabase
Currently the loop handling is heavily related to the metabase rather
than metainfo.

metainfo over time has become related to the "public API" for accessing
the metabase data.

Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.

Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
2021-04-22 12:58:09 +03:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Fadila Khadar
bde367ae73 satellite/gc: check on bloom filter creation date
Check that the bloom filter creation date is earlier than the
metainfo loop system time used for db scanning.

Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4
2021-04-14 16:40:37 +00:00
Michał Niewrzał
a5224e7a6c satellite/metainfo/metaloop: use segment CreatedAt and RepairedAt
Repair checker expects to have information about CreatedAt and RepairedAt fields to calculate segment age metric.

Change-Id: I6b41df880d77133be541e14d10d91cc75759b339
2021-04-02 08:46:54 +00:00
Kaloyan Raev
035c393da0 satellite: update tests to pass etag.Reader to multipart.PutObjectPart
Change-Id: Ibe99357945ae7a91f5b5d4f87b83d425c9fa84a5
2021-03-29 13:18:11 +00:00
Michał Niewrzał
141444f6d6 satellite/repair/repairer: fix segmentAge metric
Change-Id: I146b3163aa1bfab5ee060298e6bf9822ca6820a0
2021-03-29 12:29:47 +00:00
Egon Elbre
86e698f572 pb: use *UnimplementedServer to avoid breaking API changes
Change-Id: I99a34eeb37ac4453411f273511710562a519f57a
2021-03-29 12:26:10 +03:00
Egon Elbre
f19ef4afe5 satellite/metainfo/metaloop: move loop to a separate package
Change-Id: I94c931a27c1af6062185ec62688624ec02050f11
2021-03-23 15:37:34 +00:00
Michał Niewrzał
27ae0d1f15 satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.

Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
2021-03-22 08:12:56 +00:00
Egon Elbre
4c0ea717eb satellite/metainfo: remove unneeded dependencies from Loop
metainfo.Loop doesn't require buckets nor pointerdb anymore.

Also:
* fix comments
* update full iterator limit to 2500

Change-Id: I6604402868f5c34079197c407f969ac8015e63c5
2021-02-19 15:11:16 +02:00
Egon Elbre
c860b74a37 satellite/repair/checker: allow for multipart objects
We have multipart objects so we may get multiple inline segments
sequences or no segments at all for objects.

Change-Id: Ie46ee777a2db8f18f7154e3443bb9e07ecb170f7
2021-02-18 20:31:49 +02:00
Michał Niewrzał
908a96ae30 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791
2021-02-11 11:48:23 +01:00
Cameron Ayer
4a797baa73 satellite/repair/repairer: a new set of rs_scheme tagged metrics
Change-Id: Ibecd9265da881247eeb85ba185ee8877a7243777
2021-02-09 14:19:22 +00:00
Michał Niewrzał
9a60011774 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: Ia90f29be432e207c4125f7f955c912978eabe59a
2021-02-04 09:38:08 +01:00
Kaloyan Raev
8d25c47897 satellite/repair: fix comment in TestRepairExpiredSegment
Change-Id: Ib91e81f6ba0a7f65daed157b78f7a1a108984930
2021-02-03 10:09:49 +02:00
Kaloyan Raev
038bd0a4da satellite/repair/repairer: fix repair for pending objects
https://storjlabs.atlassian.net/browse/PG-160

Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508
2021-02-02 19:50:10 +02:00
Kaloyan Raev
6f3d0c4ad5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/repair/repair_test.go
	satellite/repair/repairer/segments.go

Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906
2021-02-02 19:19:04 +02:00
Kaloyan Raev
339d1212cd satellite/repair: don't remove expired segments from repair queue
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.

Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.

So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.

Removing this check will also simplify how we migrate this code
correctly to the metabase.

Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
2021-02-02 16:13:01 +00:00
Kaloyan Raev
d0612199f0 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/config.go
	satellite/metainfo/metainfo_test.go

Change-Id: I95cf3c1d020a7918795b5eec63f36112fdb86749
2021-02-01 14:32:12 +02:00
Cameron Ayer
89e682b4d7 satellite/repair/checker: add 29/80/130-52 to default repair overrides
Change-Id: I2e5a7538fdf33f3869fcb65fc88f7abb10faad79
2021-01-28 16:55:16 -05:00
Michał Niewrzał
ec88d21a3c Merge 'main' branch.
Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae
2021-01-11 10:26:58 +01:00
Moby von Briesen
a90d6fcad8 satellite/repair/checker: Use segment health on checker insert
Do not insert the number of healthy pieces for segment health anymore.
Rather, insert the segment health calculated by our new priority
function.

Change-Id: Ieee7fb2deee89f4d79ae85bac7f577befa2a0c7f
2021-01-04 11:48:17 -05:00
Michał Niewrzał
ad3e3a38c5 Merge 'main' branch
Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200
2021-01-04 12:13:45 +01:00
paul cannon
7246368ca1 satellite/repair: clamp totalNodes to 100 or higher
Change-Id: I239418ed3671b1cee30b0b1797dc434244e72448
2020-12-30 10:39:14 -06:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Kaloyan Raev
bafc6af992 ci: remove workaround for failing tests
Change-Id: I3eb673fae6c81bee17d7437cb870d5f5ba6978d5
2020-12-21 18:07:40 +02:00
Kaloyan Raev
4d37d14929 satellite/{metrics,repair}: adjust monitoring to new metainfo loop
Change-Id: I87a2145daa5ed49bb2c08d6967baa09c0b14b4c6
2020-12-21 09:05:17 +02:00
Michal Niewrzal
f7a31308db satellite/repair: enable TestRemoveExpiredSegmentFromQueue test
Change adds ability to set `now` time during test for repair.

Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f
2020-12-18 10:58:05 +00:00
Michal Niewrzal
2111740236 Merge 'master' branch
Change-Id: Ib73af0ff3ce0e9a1547b0b9fc55bf88704f6f394
2020-12-18 09:13:24 +01:00
paul cannon
d3604a5e90 satellite/repair: use survivability model for segment health
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).

Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.

Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.

This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.

Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-17 21:30:17 +00:00
Michal Niewrzal
70ba4deea9 satellite/repair/checker: adjust irreparable part of repair checker
Change-Id: I0732104a97ba18a5359de3966cd692677a0ff790
2020-12-17 14:11:22 +00:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Michal Niewrzal
2381ca2810 Merge 'master' branch
Change-Id: I4a3e45a2a2cdacfd87d16b148cfb4c6671c20b15
2020-12-17 13:17:17 +01:00