Commit Graph

39 Commits

Author SHA1 Message Date
Egon Elbre
0858c3797a satellite/{metabase,satellitedb}: deduplicate AS OF SYSTEM TIME code
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.

As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.

Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.

Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
2021-05-11 12:40:36 +03:00
Egon Elbre
a2e20c93ae private/dbutil: use dbutil and tagsql from storj.io/private
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.

Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
2021-04-23 14:36:52 +03:00
Cameron Ayer
28eaae66af satellite/satellitedb: drop num_healthy_pieces column from injuredsegments
This column is no longer used as it has been replaced by the segment_health
column.

Change-Id: I6b4df89cd4f994d8418976f88e8c5f57615f8115
2020-12-17 20:17:08 +00:00
Jessica Grebenschikov
0649d2b930 satellite/repair: improve contention for injuredsegments table on CRDB
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.

The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).

2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..

This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.

Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
2020-12-10 09:51:26 -08:00
Egon Elbre
28ea63be92 satellite/repair: avoid TestDBAccess
Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf
2020-11-30 13:29:08 +02:00
JT Olio
0ba516d405 satellite: support pointing db components at different databases
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.

Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
2020-11-28 18:39:16 +00:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
Egon Elbre
004e610d0f satellite/internalpb: move datarepair.pb to internal
Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d
2020-10-30 13:28:14 +02:00
Egon Elbre
2268cc1df3 all: fix linter complaints
Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95
2020-10-13 15:59:01 +03:00
Cameron Ayer
c2525ba2b5 satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.

solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.

Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
2020-09-29 20:38:22 +00:00
Cameron Ayer
e7c34a053d satellite/satellitedb: add column and index "updated_at" to injuredsegments
Change-Id: I59e9bb2077885f09e17795375fe98ed31bd83d54
2020-09-14 12:53:04 -04:00
stefanbenten
257855b5de all: replace == comparison with errors.Is
Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714
2020-07-14 15:50:25 +00:00
Jeff Wendling
254b42ff65 satellite/satellitedb: fix leaked rows from repairQueue.Insert
Change-Id: If5e62c49770f591ebe3f4d2dd4dd2658c229a022
2020-06-03 14:31:21 -06:00
Moby von Briesen
290c006a10 satellite/repair/{checker,queue}: add metric for new segments added to repair queue
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration

Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
2020-05-27 06:23:47 +00:00
Bill Thorp
e99e675fb1 satellite/satellitedb: use time zones with all timestamps
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.

Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding  AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'

Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.

https://storjlabs.atlassian.net/browse/SM-104

Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
2020-03-05 21:11:25 +00:00
Moby von Briesen
4e5a7f13c7 satellite/repair/queue: Prioritize selection of items off repair queue by segment health
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.

The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.

We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.

Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
2020-02-26 09:54:16 -05:00
Egon Elbre
76fdb5d863 storage: add configurable lookup limits
Currently storage tests were tied to the default lookup limit.
By increasing the limits, the tests will take longer and sometimes
cause a large number of goroutines to be started.

This change adds configurable lookup limit to all storage backends.

Also remove boltdb.NewShared, since it's not used any more.

Change-Id: I1a052f149da471246fac5745da133c3cfc27582e
2020-01-22 21:35:56 +02:00
Egon Elbre
7d79aab14e satellite/satellitedb: fixes to row handling
Change-Id: I48fae692bcca152143a12f333296c42471538850
2020-01-16 17:07:26 +02:00
paul cannon
4a26fb5bd5 satellite/satellitedb: don't use crdb.ExecuteTx with postgres
crdb.ExecuteTx is great, but I don't think it will work right with
PostgreSQL. It works by way of cockroach savepoints, which allows
it to react to retryable errors, whereas tx.Commit() doesn't. But
I don't think PostgreSQL savepoints work exactly the same way. I'm not
100% sure, but it doesn't seem worth the risk.

So, I'm switching one case here to use the new dbutil.WithTx instead,
which will use crdb.ExecuteTx if appropriate. The other case doesn't
need a transaction at all.

Change-Id: I39283f3b5d8d47596db7aff5048bb74597e5918f
2020-01-06 23:51:35 +00:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
paul cannon
b5ddfc6fa5 satellite/satellitedb: unexport satellitedb.DB
Backstory: I needed a better way to pass around information about the
underlying driver and implementation to all the various db-using things
in satellitedb (at least until some new "cockroach driver" support makes
it to DBX). After hitting a few dead ends, I decided I wanted to have a
type that could act like a *dbx.DB but which would also carry
information about the implementation, etc. Then I could pass around that
type to all the things in satellitedb that previously wanted *dbx.DB.

But then I realized that *satellitedb.DB was, essentially, exactly that
already.

One thing that might have kept *satellitedb.DB from being directly
usable was that embedding a *dbx.DB inside it would make a lot of dbx
methods publicly available on a *satellitedb.DB instance that previously
were nicely encapsulated and hidden. But after a quick look, I realized
that _nothing_ outside of satellite/satellitedb even needs to use
satellitedb.DB at all. It didn't even need to be exported, except for
some trivially-replaceable code in migrate_postgres_test.go. And once
I made it unexported, any concerns about exposing new methods on it were
entirely moot.

So I have here changed the exported *satellitedb.DB type into the
unexported *satellitedb.satelliteDB type, and I have changed all the
places here that wanted raw dbx.DB handles to use this new type instead.
Now they can just take a gander at the implementation member on it and
know all they need to know about the underlying database.

This will make it possible for some other pending code here to
differentiate between postgres and cockroach backends.

Change-Id: I27af99f8ae23b50782333da5277b553b34634edc
2019-12-16 19:09:30 +00:00
Jennifer Johnson
8e9532dbd9 satellitedb/repairqueue: use errs.New() instead of fmt.Errorf() to retain stack trace
Change-Id: I47f1985aaeace556e3e0f7a20d2718410936db17
2019-12-05 15:52:25 +00:00
Jennifer Johnson
7c5f777a4f satellitedb/repairqueue: runs a different implementation of the query within Select() for postgres vs cockroach
Change-Id: Ie34dbdb9d870d7d9f8f269702b6b3bad0c55b98e
2019-12-04 21:32:02 +00:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
Egon Elbre
3c438f31bd
satellite/satellitedb: remove sqlite support (#3296) 2019-10-19 00:27:57 +03:00
Alexander Leitner
159ad439b1
Add count to repair queue (#2661)
* Add count to repair queue
2019-07-30 11:21:40 -04:00
Egon Elbre
0cdeae1922 add missing error handling (#2630) 2019-07-25 17:01:44 +02:00
Ivan Fraixedes
3c8f1370d2
[v3 2137] - Add more info to find out repair failures (#2623)
* pkg/datarepair/repairer: Track always time for repair
  Make a minor change in the worker function of the repairer, that when
  successful, always track the metric time for repair independently if the
  time since checker queue metric can be tracked.

* storage/postgreskv: Wrap error in Get func
  Wrap the returned error of the Get function as it is done when the
  query doesn't return any row.

* satellite/metainfo: Move debug msg to the right place
  NewStore function was writing a debug log message when the DB was
  connected, however it was always writing it out despite if an error
  happened when getting the connection.

* pkg/datarepair/repairer: Wrap error before logging it
  Wrap the error returned by process which is executed by the Run method
  of the repairer service to add context to the error log message.

* pkg/datarepair/repairer: Make errors more specific in worker
  Make the error messages of the "worker" method of the Service more
  specific and the logged message for such errors.

* pkg/storage/repair: Improve error reporting Repair
  In order of improving the error reporting by the
  pkg/storage/repair.Repair method, several errors of this method and
  functions/methods which this one relies one have been updated to be
  wrapper into their corresponding classes.

* pkg/storage/segments: Track path param of Repair method
  Track in monkit the path parameter passed to the Repair method.

* satellite/satellitedb: Wrap Error returned by Delete
  Wrap the error returned by repairQueue.Delete method to enhance the
  error with a class and stack and the
  pkg/storage/segments.Repairer.Repair method get a more contextualized
  error from it.
2019-07-23 16:28:06 +02:00
Maximillian von Briesen
b590e53d64
Order by attempted time in injured segments select (#2533) 2019-07-12 13:35:20 -04:00
Michal Niewrzal
268c629ba8
Replace base64 encoding for path segments (#2345) 2019-07-11 13:26:07 -04:00
JT Olio
29d16b4d68 satellite: add monkit task to missing places (#2108) 2019-06-04 13:55:37 +02:00
Bill Thorp
a6c4019288
using DB time only (#2018)
* using DB time only, using UTC
2019-05-21 12:50:55 -04:00
Egon Elbre
f7ed63a119
handle database error checks properly (#1796) 2019-04-23 14:13:57 +03:00
Bill Thorp
17a227e6e9
refactor injuredsegments db so that we can't have duplicates (#1717)
made repairqueue not use a true queue, forbid duplicates
2019-04-16 14:14:09 -04:00
Bill Thorp
665fd33e3c
Repair queue isolation level fix (#1466)
Implemented custom SQLite and Postgres Repairqueue Dequeue handlers
2019-03-14 17:12:47 -04:00
Jennifer Li Johnson
856b98997c
updates copyright 2018 to 2019 (#1133) 2019-01-24 15:15:10 -05:00
Michal Niewrzal
b712fbcbb0
Fix 'empty queue' error when satellite starts (#939) 2019-01-02 17:00:32 +01:00
Egon Elbre
4346cd060f
Implement mutex around satellitedb (#932) 2018-12-27 11:56:25 +02:00
Cameron
f70b826fd4
repair queue masterDB support (#865)
* add injuredsegment model to satellitedb.dbx

* add context to queue.RepairQueue interface

* use queue.RepairQueue interface, use masterdb
2018-12-21 10:11:19 -05:00