storj

Author	SHA1	Message	Date
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Egon Elbre	0858c3797a	satellite/{metabase,satellitedb}: deduplicate AS OF SYSTEM TIME code Currently we were duplicating code for AS OF SYSTEM TIME in several places. This replaces the code with using a method on dbutil.Implementation. As a consequence it's more useful to use a shorter name for implementation - 'impl' should be sufficiently clear in the context. Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish between the two modes is useful and slightly shorter without causing confusion. Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d	2021-05-11 12:40:36 +03:00
Egon Elbre	a2e20c93ae	private/dbutil: use dbutil and tagsql from storj.io/private Initially we duplicated the code to avoid large scale changes to the packages. Now we are past metainfo refactor we can remove the duplication. Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef	2021-04-23 14:36:52 +03:00
Cameron Ayer	28eaae66af	satellite/satellitedb: drop num_healthy_pieces column from injuredsegments This column is no longer used as it has been replaced by the segment_health column. Change-Id: I6b4df89cd4f994d8418976f88e8c5f57615f8115	2020-12-17 20:17:08 +00:00
Jessica Grebenschikov	0649d2b930	satellite/repair: improve contention for injuredsegments table on CRDB We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem: 1) crdb doesn't support `FOR UPDATE SKIP LOCKED` 2) the original crdb Select query was doing 2 full table scans and not using any indexes 3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention. The changes in this PR should help to reduce that contention and improve performance on CRDB. The changes include: 1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB. - Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/). 2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability. - While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar.. This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors. Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23	2020-12-10 09:51:26 -08:00
Egon Elbre	28ea63be92	satellite/repair: avoid TestDBAccess Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf	2020-11-30 13:29:08 +02:00
JT Olio	0ba516d405	satellite: support pointing db components at different databases the immediate need is to be able to move the repair queue back out of cockroach if we can't save it. Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326	2020-11-28 18:39:16 +00:00
Moby von Briesen	0ec685b173	satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue We plan to add support for a new Reed-Solomon scheme soon, but our repair queue orders segments by least number of healthy pieces first. With a second RS scheme, fewer healthy pieces will not necessarily correlate to lower health. This change just adds the new column in a migration. A separate change will add the new health function. Right now, since we only support one RS scheme, behavior will not change. Number of healthy pieces is being inserted as "segment health" until the new health function is merged. Segment health is calculated with a new priority function created in commit `3e5640359`. In order to use the function, a new config value is added, called NodeFailureRate, representing the approximate probability of any individual node going down in the duration of one checker run. Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2	2020-11-16 21:18:09 +00:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
Egon Elbre	2268cc1df3	all: fix linter complaints Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95	2020-10-13 15:59:01 +03:00
Cameron Ayer	c2525ba2b5	satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration Repair workers prioritize the most unhealthy segments. This has the consequence that when we finally begin to reach the end of the queue, a good portion of the remaining segments are healthy again as their nodes have come back online. This makes it appear that there are more injured segments than there actually are. solution: Any time the checker observes an injured segment it inserts it into the repair queue or updates it if it already exists. Therefore, we can determine which segments are no longer injured if they were not inserted or updated by the last checker iteration. To do this we add a new column to the injured segments table, updated_at, which is set to the current time when a segment is inserted or updated. At the end of the checker iteration, we can delete any items where updated_at < checker start. Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94	2020-09-29 20:38:22 +00:00
Cameron Ayer	e7c34a053d	satellite/satellitedb: add column and index "updated_at" to injuredsegments Change-Id: I59e9bb2077885f09e17795375fe98ed31bd83d54	2020-09-14 12:53:04 -04:00
stefanbenten	257855b5de	all: replace == comparison with errors.Is Change-Id: I05d9a369c7c6f144b94a4c524e8aea18eb9cb714	2020-07-14 15:50:25 +00:00
Jeff Wendling	254b42ff65	satellite/satellitedb: fix leaked rows from repairQueue.Insert Change-Id: If5e62c49770f591ebe3f4d2dd4dd2658c229a022	2020-06-03 14:31:21 -06:00
Moby von Briesen	290c006a10	satellite/repair/{checker,queue}: add metric for new segments added to repair queue * add monkit stat new_remote_segments_needing_repair, which reports the number of new unhealthy segments in the repair queue since the previous checker iteration Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3	2020-05-27 06:23:47 +00:00
Bill Thorp	e99e675fb1	satellite/satellitedb: use time zones with all timestamps The migration was broken into one migration per table to reduce table locking and reduce the chances of failure due to SQL timeouts. Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them. These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE') Field with good data are migrated by adding AT TIME ZONE 'UTC' Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used. https://storjlabs.atlassian.net/browse/SM-104 Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce	2020-03-05 21:11:25 +00:00
Moby von Briesen	4e5a7f13c7	satellite/repair/queue: Prioritize selection of items off repair queue by segment health Add a column to the repair queue table in the satellite db for healthy piece count. When an item is selected from the repair queue, the least durable segment that has not been attempted in the past hour should be selected first. This prevents our repairer from getting stuck doing work on segments that are close to the repair threshold while allowing segments that are more unhealthy to degrade further. The migration also clears the repair queue so that the migration runs quickly and we can properly account for segment health in future repair work. We do not select items off the repair queue that have been attempted in the past six hours. This was changed from on hour to allow us time to try a wider variety of segments when the repair queue is very large. Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b	2020-02-26 09:54:16 -05:00
Egon Elbre	76fdb5d863	storage: add configurable lookup limits Currently storage tests were tied to the default lookup limit. By increasing the limits, the tests will take longer and sometimes cause a large number of goroutines to be started. This change adds configurable lookup limit to all storage backends. Also remove boltdb.NewShared, since it's not used any more. Change-Id: I1a052f149da471246fac5745da133c3cfc27582e	2020-01-22 21:35:56 +02:00
Egon Elbre	7d79aab14e	satellite/satellitedb: fixes to row handling Change-Id: I48fae692bcca152143a12f333296c42471538850	2020-01-16 17:07:26 +02:00
paul cannon	4a26fb5bd5	satellite/satellitedb: don't use crdb.ExecuteTx with postgres crdb.ExecuteTx is great, but I don't think it will work right with PostgreSQL. It works by way of cockroach savepoints, which allows it to react to retryable errors, whereas tx.Commit() doesn't. But I don't think PostgreSQL savepoints work exactly the same way. I'm not 100% sure, but it doesn't seem worth the risk. So, I'm switching one case here to use the new dbutil.WithTx instead, which will use crdb.ExecuteTx if appropriate. The other case doesn't need a transaction at all. Change-Id: I39283f3b5d8d47596db7aff5048bb74597e5918f	2020-01-06 23:51:35 +00:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
paul cannon	b5ddfc6fa5	satellite/satellitedb: unexport satellitedb.DB Backstory: I needed a better way to pass around information about the underlying driver and implementation to all the various db-using things in satellitedb (at least until some new "cockroach driver" support makes it to DBX). After hitting a few dead ends, I decided I wanted to have a type that could act like a dbx.DB but which would also carry information about the implementation, etc. Then I could pass around that type to all the things in satellitedb that previously wanted dbx.DB. But then I realized that satellitedb.DB was, essentially, exactly that already. One thing that might have kept satellitedb.DB from being directly usable was that embedding a dbx.DB inside it would make a lot of dbx methods publicly available on a satellitedb.DB instance that previously were nicely encapsulated and hidden. But after a quick look, I realized that _nothing_ outside of satellite/satellitedb even needs to use satellitedb.DB at all. It didn't even need to be exported, except for some trivially-replaceable code in migrate_postgres_test.go. And once I made it unexported, any concerns about exposing new methods on it were entirely moot. So I have here changed the exported satellitedb.DB type into the unexported satellitedb.satelliteDB type, and I have changed all the places here that wanted raw dbx.DB handles to use this new type instead. Now they can just take a gander at the implementation member on it and know all they need to know about the underlying database. This will make it possible for some other pending code here to differentiate between postgres and cockroach backends. Change-Id: I27af99f8ae23b50782333da5277b553b34634edc	2019-12-16 19:09:30 +00:00
Jennifer Johnson	8e9532dbd9	satellitedb/repairqueue: use errs.New() instead of fmt.Errorf() to retain stack trace Change-Id: I47f1985aaeace556e3e0f7a20d2718410936db17	2019-12-05 15:52:25 +00:00
Jennifer Johnson	7c5f777a4f	satellitedb/repairqueue: runs a different implementation of the query within Select() for postgres vs cockroach Change-Id: Ie34dbdb9d870d7d9f8f269702b6b3bad0c55b98e	2019-12-04 21:32:02 +00:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Egon Elbre	3c438f31bd	satellite/satellitedb: remove sqlite support (#3296 )	2019-10-19 00:27:57 +03:00
Alexander Leitner	159ad439b1	Add count to repair queue (#2661 ) * Add count to repair queue	2019-07-30 11:21:40 -04:00
Egon Elbre	0cdeae1922	add missing error handling (#2630 )	2019-07-25 17:01:44 +02:00
Ivan Fraixedes	3c8f1370d2	[v3 2137] - Add more info to find out repair failures (#2623 ) * pkg/datarepair/repairer: Track always time for repair Make a minor change in the worker function of the repairer, that when successful, always track the metric time for repair independently if the time since checker queue metric can be tracked. * storage/postgreskv: Wrap error in Get func Wrap the returned error of the Get function as it is done when the query doesn't return any row. * satellite/metainfo: Move debug msg to the right place NewStore function was writing a debug log message when the DB was connected, however it was always writing it out despite if an error happened when getting the connection. * pkg/datarepair/repairer: Wrap error before logging it Wrap the error returned by process which is executed by the Run method of the repairer service to add context to the error log message. * pkg/datarepair/repairer: Make errors more specific in worker Make the error messages of the "worker" method of the Service more specific and the logged message for such errors. * pkg/storage/repair: Improve error reporting Repair In order of improving the error reporting by the pkg/storage/repair.Repair method, several errors of this method and functions/methods which this one relies one have been updated to be wrapper into their corresponding classes. * pkg/storage/segments: Track path param of Repair method Track in monkit the path parameter passed to the Repair method. * satellite/satellitedb: Wrap Error returned by Delete Wrap the error returned by repairQueue.Delete method to enhance the error with a class and stack and the pkg/storage/segments.Repairer.Repair method get a more contextualized error from it.	2019-07-23 16:28:06 +02:00
Maximillian von Briesen	b590e53d64	Order by attempted time in injured segments select (#2533 )	2019-07-12 13:35:20 -04:00
Michal Niewrzal	268c629ba8	Replace base64 encoding for path segments (#2345 )	2019-07-11 13:26:07 -04:00
JT Olio	29d16b4d68	satellite: add monkit task to missing places (#2108 )	2019-06-04 13:55:37 +02:00
Bill Thorp	a6c4019288	using DB time only (#2018 ) * using DB time only, using UTC	2019-05-21 12:50:55 -04:00
Egon Elbre	f7ed63a119	handle database error checks properly (#1796 )	2019-04-23 14:13:57 +03:00
Bill Thorp	17a227e6e9	refactor injuredsegments db so that we can't have duplicates (#1717 ) made repairqueue not use a true queue, forbid duplicates	2019-04-16 14:14:09 -04:00
Bill Thorp	665fd33e3c	Repair queue isolation level fix (#1466 ) Implemented custom SQLite and Postgres Repairqueue Dequeue handlers	2019-03-14 17:12:47 -04:00
Jennifer Li Johnson	856b98997c	updates copyright 2018 to 2019 (#1133 )	2019-01-24 15:15:10 -05:00
Michal Niewrzal	b712fbcbb0	Fix 'empty queue' error when satellite starts (#939 )	2019-01-02 17:00:32 +01:00
Egon Elbre	4346cd060f	Implement mutex around satellitedb (#932 )	2018-12-27 11:56:25 +02:00
Cameron	f70b826fd4	repair queue masterDB support (#865 ) * add injuredsegment model to satellitedb.dbx * add context to queue.RepairQueue interface * use queue.RepairQueue interface, use masterdb	2018-12-21 10:11:19 -05:00

40 Commits