Commit Graph

181 Commits

Author SHA1 Message Date
JT Olio
da9ca0c650 testplanet/satellite: reduce the number of places default values need to be configured
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.

As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.

This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.

In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.

Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
2021-06-01 22:14:17 +00:00
Egon Elbre
10372afbe4 ci: fix lint errors
Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf
2021-05-17 13:37:31 +00:00
Cameron Ayer
3ea7aa2c7a satellite/repair/repairer: log piece hash verification failures
Piece hash verification failures during repair download are considered
audit failures, but we are not logging these occurrences. Now we log
them.

Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653
2021-05-14 15:03:15 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
69b149a66f mod: bump uplink
uplink stopped using zap, hence some of the private methods needed to be
changed.

Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5
2021-05-06 14:48:36 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Michał Niewrzał
7944df20d6 storj: use multipart API
Change-Id: I10b401434e3e77468d12ecd225b41689568fd197
2021-04-26 13:15:09 +00:00
Egon Elbre
a2e20c93ae private/dbutil: use dbutil and tagsql from storj.io/private
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.

Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
2021-04-23 14:36:52 +03:00
Egon Elbre
4c9ed64f75 satellite/metabase/metaloop: move loop under metabase
Currently the loop handling is heavily related to the metabase rather
than metainfo.

metainfo over time has become related to the "public API" for accessing
the metabase data.

Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.

Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
2021-04-22 12:58:09 +03:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Fadila Khadar
bde367ae73 satellite/gc: check on bloom filter creation date
Check that the bloom filter creation date is earlier than the
metainfo loop system time used for db scanning.

Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4
2021-04-14 16:40:37 +00:00
Michał Niewrzał
a5224e7a6c satellite/metainfo/metaloop: use segment CreatedAt and RepairedAt
Repair checker expects to have information about CreatedAt and RepairedAt fields to calculate segment age metric.

Change-Id: I6b41df880d77133be541e14d10d91cc75759b339
2021-04-02 08:46:54 +00:00
Kaloyan Raev
035c393da0 satellite: update tests to pass etag.Reader to multipart.PutObjectPart
Change-Id: Ibe99357945ae7a91f5b5d4f87b83d425c9fa84a5
2021-03-29 13:18:11 +00:00
Michał Niewrzał
141444f6d6 satellite/repair/repairer: fix segmentAge metric
Change-Id: I146b3163aa1bfab5ee060298e6bf9822ca6820a0
2021-03-29 12:29:47 +00:00
Egon Elbre
86e698f572 pb: use *UnimplementedServer to avoid breaking API changes
Change-Id: I99a34eeb37ac4453411f273511710562a519f57a
2021-03-29 12:26:10 +03:00
Egon Elbre
f19ef4afe5 satellite/metainfo/metaloop: move loop to a separate package
Change-Id: I94c931a27c1af6062185ec62688624ec02050f11
2021-03-23 15:37:34 +00:00
Michał Niewrzał
27ae0d1f15 satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.

Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
2021-03-22 08:12:56 +00:00
Egon Elbre
4c0ea717eb satellite/metainfo: remove unneeded dependencies from Loop
metainfo.Loop doesn't require buckets nor pointerdb anymore.

Also:
* fix comments
* update full iterator limit to 2500

Change-Id: I6604402868f5c34079197c407f969ac8015e63c5
2021-02-19 15:11:16 +02:00
Egon Elbre
c860b74a37 satellite/repair/checker: allow for multipart objects
We have multipart objects so we may get multiple inline segments
sequences or no segments at all for objects.

Change-Id: Ie46ee777a2db8f18f7154e3443bb9e07ecb170f7
2021-02-18 20:31:49 +02:00
Michał Niewrzał
908a96ae30 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791
2021-02-11 11:48:23 +01:00
Cameron Ayer
4a797baa73 satellite/repair/repairer: a new set of rs_scheme tagged metrics
Change-Id: Ibecd9265da881247eeb85ba185ee8877a7243777
2021-02-09 14:19:22 +00:00
Michał Niewrzał
9a60011774 Merge remote-tracking branch 'origin/main' into multipart-upload
Change-Id: Ia90f29be432e207c4125f7f955c912978eabe59a
2021-02-04 09:38:08 +01:00
Kaloyan Raev
8d25c47897 satellite/repair: fix comment in TestRepairExpiredSegment
Change-Id: Ib91e81f6ba0a7f65daed157b78f7a1a108984930
2021-02-03 10:09:49 +02:00
Kaloyan Raev
038bd0a4da satellite/repair/repairer: fix repair for pending objects
https://storjlabs.atlassian.net/browse/PG-160

Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508
2021-02-02 19:50:10 +02:00
Kaloyan Raev
6f3d0c4ad5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/repair/repair_test.go
	satellite/repair/repairer/segments.go

Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906
2021-02-02 19:19:04 +02:00
Kaloyan Raev
339d1212cd satellite/repair: don't remove expired segments from repair queue
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.

Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.

So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.

Removing this check will also simplify how we migrate this code
correctly to the metabase.

Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
2021-02-02 16:13:01 +00:00
Kaloyan Raev
d0612199f0 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/config.go
	satellite/metainfo/metainfo_test.go

Change-Id: I95cf3c1d020a7918795b5eec63f36112fdb86749
2021-02-01 14:32:12 +02:00
Cameron Ayer
89e682b4d7 satellite/repair/checker: add 29/80/130-52 to default repair overrides
Change-Id: I2e5a7538fdf33f3869fcb65fc88f7abb10faad79
2021-01-28 16:55:16 -05:00
Michał Niewrzał
ec88d21a3c Merge 'main' branch.
Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae
2021-01-11 10:26:58 +01:00
Moby von Briesen
a90d6fcad8 satellite/repair/checker: Use segment health on checker insert
Do not insert the number of healthy pieces for segment health anymore.
Rather, insert the segment health calculated by our new priority
function.

Change-Id: Ieee7fb2deee89f4d79ae85bac7f577befa2a0c7f
2021-01-04 11:48:17 -05:00
Michał Niewrzał
ad3e3a38c5 Merge 'main' branch
Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200
2021-01-04 12:13:45 +01:00
paul cannon
7246368ca1 satellite/repair: clamp totalNodes to 100 or higher
Change-Id: I239418ed3671b1cee30b0b1797dc434244e72448
2020-12-30 10:39:14 -06:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Kaloyan Raev
bafc6af992 ci: remove workaround for failing tests
Change-Id: I3eb673fae6c81bee17d7437cb870d5f5ba6978d5
2020-12-21 18:07:40 +02:00
Kaloyan Raev
4d37d14929 satellite/{metrics,repair}: adjust monitoring to new metainfo loop
Change-Id: I87a2145daa5ed49bb2c08d6967baa09c0b14b4c6
2020-12-21 09:05:17 +02:00
Michal Niewrzal
f7a31308db satellite/repair: enable TestRemoveExpiredSegmentFromQueue test
Change adds ability to set `now` time during test for repair.

Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f
2020-12-18 10:58:05 +00:00
Michal Niewrzal
2111740236 Merge 'master' branch
Change-Id: Ib73af0ff3ce0e9a1547b0b9fc55bf88704f6f394
2020-12-18 09:13:24 +01:00
paul cannon
d3604a5e90 satellite/repair: use survivability model for segment health
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).

Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.

Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.

This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.

Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-17 21:30:17 +00:00
Michal Niewrzal
70ba4deea9 satellite/repair/checker: adjust irreparable part of repair checker
Change-Id: I0732104a97ba18a5359de3966cd692677a0ff790
2020-12-17 14:11:22 +00:00
Kaloyan Raev
9aa61245d0 satellite/audits: migrate to metabase
Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2
2020-12-17 14:38:48 +02:00
Michal Niewrzal
2381ca2810 Merge 'master' branch
Change-Id: I4a3e45a2a2cdacfd87d16b148cfb4c6671c20b15
2020-12-17 13:17:17 +01:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Cameron Ayer
8c52bb3a18 satellite/checker: use numHealthy as segment health in repair queue
A few weeks ago it was discovered that the segment health function
was not working as expected with production values. As a bandaid,
we decided to insert the number of healthy pieces into the segment
health column. This should have effectively reverted our means of
prioritizing repair to the previous implementation.

However, it turns out that the bandaid was placed into the code which
removes items from the irreparable db and inserts them into the repair
queue.

This change: insert number of healthy pieces into the repair queue in the
method, RemoteSegment

Change-Id: Iabfc7984df0a928066b69e9aecb6f615253f1ad2
2020-12-15 17:16:59 -05:00
Cameron Ayer
2ac72eaf16 satellite/repair/checker: add new monkit stats tagged with rs scheme
There is a new checker field called statsCollector. This contains
a map of stats pointers where the key is a stringified redundancy
scheme. stats contains all tagged monkit metrics. These metrics exist
under the key name, "tagged_repair_stats", which is tagged with the
name of each metric and a corresponding rs scheme.

As the metainfo observer works on a segment, it checks statsCollector
for a stats corresponding to the segment's redundancy scheme. If one
doesn't exist, it is created and chained to the monkit scope. Now we can call
Observe, Inc, etc on the fields just like before, and they have tags!

durabilityStats has also been renamed to aggregateStats.

At the end of the metainfo loop, we insert the aggregateStats totals into the
corresponding stats fields for metric reporting.

Change-Id: I8aa1918351d246a8ef818b9712ed4cb39d1ea9c6
2020-12-15 14:08:01 +00:00
Michal Niewrzal
934ae32ca4 satellite/repair/checker: fix checker tests
Change-Id: I63d3368a07b800fdb10bb93b847eb32927b8c0dc
2020-12-15 10:47:42 +00:00
Michal Niewrzal
57f374af24 Merge 'master' branch
Change-Id: Idf6b10ea7ca94e4d232e6a3b6a38ef2e646ba197
2020-12-15 08:26:53 +01:00
Kaloyan Raev
fc85179a19 satellite/metainfo: refactor SegmentLocation.Index to SegmentPosition
Change-Id: Ic9403c8126712693326dd83d6ba4f3b84be3e0c7
2020-12-14 13:35:53 +02:00
Jessica Grebenschikov
0649d2b930 satellite/repair: improve contention for injuredsegments table on CRDB
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.

The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).

2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..

This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.

Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
2020-12-10 09:51:26 -08:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Egon Elbre
28ea63be92 satellite/repair: avoid TestDBAccess
Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf
2020-11-30 13:29:08 +02:00