storj

Author	SHA1	Message	Date
Erik van Velzen	db1cc8ca95	satellite/repair/checker: buffer repair queue Integrate previous changes. Speed up the segment loop by batch inserting into repair queue. Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276	2022-05-12 16:28:05 +00:00
Erik van Velzen	928375a67c	satellite/repair/queue: buffer batch insert Implement a buffer for inserting repair items into the queue in a batch. Part of https://github.com/storj/storj/issues/4727 Change-Id: I718472b2f2b1f4993c3d6f15c44923776407155a	2022-05-11 09:02:20 +00:00
Erik van Velzen	26f495f717	satellite/repair: implementation of batch insert Part of https://github.com/storj/storj/issues/4727 Change-Id: I44990a7614af26f8ee0be9c7aed496a1dd9e5df7	2022-05-09 12:41:22 +00:00
Erik van Velzen	10d71a8a3c	satellite/satellitedb: outline for batch insert Part of https://github.com/storj/storj/issues/4727 Change-Id: I1a9ad3b009f363e37f5e68e810074eecb7448db3	2022-05-09 11:39:52 +00:00
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Egon Elbre	a2e20c93ae	private/dbutil: use dbutil and tagsql from storj.io/private Initially we duplicated the code to avoid large scale changes to the packages. Now we are past metainfo refactor we can remove the duplication. Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef	2021-04-23 14:36:52 +03:00
Jessica Grebenschikov	0649d2b930	satellite/repair: improve contention for injuredsegments table on CRDB We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem: 1) crdb doesn't support `FOR UPDATE SKIP LOCKED` 2) the original crdb Select query was doing 2 full table scans and not using any indexes 3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention. The changes in this PR should help to reduce that contention and improve performance on CRDB. The changes include: 1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB. - Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/). 2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability. - While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar.. This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors. Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23	2020-12-10 09:51:26 -08:00
Egon Elbre	28ea63be92	satellite/repair: avoid TestDBAccess Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf	2020-11-30 13:29:08 +02:00
JT Olio	0ba516d405	satellite: support pointing db components at different databases the immediate need is to be able to move the repair queue back out of cockroach if we can't save it. Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326	2020-11-28 18:39:16 +00:00
Moby von Briesen	0ec685b173	satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue We plan to add support for a new Reed-Solomon scheme soon, but our repair queue orders segments by least number of healthy pieces first. With a second RS scheme, fewer healthy pieces will not necessarily correlate to lower health. This change just adds the new column in a migration. A separate change will add the new health function. Right now, since we only support one RS scheme, behavior will not change. Number of healthy pieces is being inserted as "segment health" until the new health function is merged. Segment health is calculated with a new priority function created in commit `3e5640359`. In order to use the function, a new config value is added, called NodeFailureRate, representing the approximate probability of any individual node going down in the duration of one checker run. Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2	2020-11-16 21:18:09 +00:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
Cameron Ayer	c2525ba2b5	satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration Repair workers prioritize the most unhealthy segments. This has the consequence that when we finally begin to reach the end of the queue, a good portion of the remaining segments are healthy again as their nodes have come back online. This makes it appear that there are more injured segments than there actually are. solution: Any time the checker observes an injured segment it inserts it into the repair queue or updates it if it already exists. Therefore, we can determine which segments are no longer injured if they were not inserted or updated by the last checker iteration. To do this we add a new column to the injured segments table, updated_at, which is set to the current time when a segment is inserted or updated. At the end of the checker iteration, we can delete any items where updated_at < checker start. Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94	2020-09-29 20:38:22 +00:00
Egon Elbre	410d897840	satellite: fix string(int) conversions Change-Id: I54c6ca8c2dad3c321175f72271b7536cc2a4df09	2020-06-12 06:41:34 +00:00
Moby von Briesen	290c006a10	satellite/repair/{checker,queue}: add metric for new segments added to repair queue * add monkit stat new_remote_segments_needing_repair, which reports the number of new unhealthy segments in the repair queue since the previous checker iteration Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3	2020-05-27 06:23:47 +00:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Bill Thorp	e99e675fb1	satellite/satellitedb: use time zones with all timestamps The migration was broken into one migration per table to reduce table locking and reduce the chances of failure due to SQL timeouts. Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them. These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE') Field with good data are migrated by adding AT TIME ZONE 'UTC' Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used. https://storjlabs.atlassian.net/browse/SM-104 Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce	2020-03-05 21:11:25 +00:00
Moby von Briesen	4e5a7f13c7	satellite/repair/queue: Prioritize selection of items off repair queue by segment health Add a column to the repair queue table in the satellite db for healthy piece count. When an item is selected from the repair queue, the least durable segment that has not been attempted in the past hour should be selected first. This prevents our repairer from getting stuck doing work on segments that are close to the repair threshold while allowing segments that are more unhealthy to degrade further. The migration also clears the repair queue so that the migration runs quickly and we can properly account for segment health in future repair work. We do not select items off the repair queue that have been attempted in the past six hours. This was changed from on hour to allow us time to try a wider variety of segments when the repair queue is very large. Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b	2020-02-26 09:54:16 -05:00
Egon Elbre	f3b4bf2b7c	satellite/satellitedb/satellitedbtest: pass ctx as an argument ctx is created in most tests, instead pass in as argument to reduce code duplication. Change-Id: I466c51c008392001129c8b007c9d6b3619935ac4	2020-01-20 16:35:42 +02:00
Jeff Wendling	9da16b1d9e	satellite/satellitedb/dbx: name the package dbx everyone was importing it as dbx anyway. why should it be named satellitedb? so yeah just pass the "-p dbx" flag. Change-Id: I5efa669f4f00f196b38a9acd0d402009475a936f	2020-01-15 15:16:39 -07:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
Cameron Ayer	a4f9865b47	satellite: adds and enables cockroachdb compatibility for tests Change-Id: I85a3ad8c3b9d7e15ea8675b6c55af0002933db57	2019-12-16 22:29:25 +00:00
Jennifer Johnson	ecb960f506	private/dbutil: distinguishes between db drivers and implementations to allow for different implementations of SQL queries. Change-Id: I2dc8d1d371139aa8bc805e92a2b80b71f580fd64	2019-12-04 18:31:26 +00:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Egon Elbre	3c438f31bd	satellite/satellitedb: remove sqlite support (#3296 )	2019-10-19 00:27:57 +03:00
Egon Elbre	a801fab66a	all: add archview annotations (#2964 )	2019-09-10 16:24:16 +03:00
Egon Elbre	6ff94caf22	satellite/satellitedb: move tests near the interface (#2863 )	2019-08-26 13:19:02 +03:00
Alexander Leitner	159ad439b1	Add count to repair queue (#2661 ) * Add count to repair queue	2019-07-30 11:21:40 -04:00
Egon Elbre	5d0816430f	rename all the things (#2531 ) * rename pkg/linksharing to linksharing * rename pkg/httpserver to linksharing/httpserver * rename pkg/eestream to uplink/eestream * rename pkg/stream to uplink/stream * rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo * rename pkg/auth/signing to pkg/signing * rename pkg/storage to uplink/storage * rename pkg/accounting to satellite/accounting * rename pkg/audit to satellite/audit * rename pkg/certdb to satellite/certdb * rename pkg/discovery to satellite/discovery * rename pkg/overlay to satellite/overlay * rename pkg/datarepair to satellite/repair	2019-07-28 08:55:36 +03:00

29 Commits