storj

Author	SHA1	Message	Date
Jessica Grebenschikov	0649d2b930	satellite/repair: improve contention for injuredsegments table on CRDB We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem: 1) crdb doesn't support `FOR UPDATE SKIP LOCKED` 2) the original crdb Select query was doing 2 full table scans and not using any indexes 3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention. The changes in this PR should help to reduce that contention and improve performance on CRDB. The changes include: 1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB. - Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/). 2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability. - While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar.. This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors. Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23	2020-12-10 09:51:26 -08:00
Michal Niewrzal	c2a97aeb14	satellite/satellitedb: add ListAllBuckets method We need to be able to list all buckets in DB without knowing project ID. This method will be used to list buckets for metainfo loop implementation based on metabase. Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2	2020-12-10 14:19:27 +00:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Ethan Adams	f90ea10a4a	Allow for DB application names per process. (#3983 )	2020-12-04 11:24:39 +01:00
Moby von Briesen	3fc76f4ffe	satellite/downtime: Remove deprecated downtime tracking service. We are no longer planning on implementing downtime penalization using the method described in docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md. Now, we are implementing the design described in docs/blueprints/storage-node-downtime-tracking-with-audits.md. This change removes the downtime estimation chores from the satellite core as well as the package satellite/downtime. A future change will remove the database table. Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518	2020-12-02 15:16:13 -05:00
JT Olio	1728c3a992	satellite/dbx: standardize on assignment Change-Id: I8f87bc8391e765e4480b0590d92d3601248e1f93	2020-12-01 16:10:18 +00:00
JT Olio	70b91aac54	satellitedb: remove cruft caused by https://review.dev.storj.io/c/storj/storj/+/3223 Change-Id: I198bb2f869cc7177b9ecafdd8932bbf2b58be5b8	2020-12-01 00:16:26 +00:00
Egon Elbre	f456d7ce03	satellite: remove implementation detail from DB interface Which database access and how it internally does migrations is an implementation detail and does not belong in the requirements interface. Change-Id: Ia4a6994f39470063a96a8e5f3a1bd27aa79fe5cd	2020-11-30 13:29:20 +02:00
Egon Elbre	28ea63be92	satellite/repair: avoid TestDBAccess Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf	2020-11-30 13:29:08 +02:00
JT Olio	71e11b27f3	satellite/dbx: only retry with cockroach Change-Id: Id3630c26dbfda36dcbece2849e2353d5ab2882af	2020-11-29 18:10:07 -07:00
JT Olio	bd23d12bb9	satellite/dbx: add cockroach retries for other QueryContext operations Change-Id: Ia30fbba55c926892702fa96fb9dd01b75347d351	2020-11-29 18:09:56 -07:00
JT Olio	ea2f39ca7f	satellite/dbx: add retries for QueryRowContext-based operations Change-Id: Ie2527b673dd4ce5250cf5c0cbf8f14921262f665	2020-11-29 18:09:46 -07:00
JT Olio	d3b0691bbd	satellite/dbx: import dbx templates these are unchanged from storj.io/dbx. we're importing them because in a later commit we will change them, and it'd be nice to see that diff as a separate commit. Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d	2020-11-29 18:09:33 -07:00
JT Olio	5d8a67a4f7	satellitedb: retry GetBandwidthSince on cockroach Change-Id: I2bf20f3a19e7f3af97630d8a679410feba70661e	2020-11-29 16:36:15 -07:00
Ethan	5dc013d3bd	satellite/overlay: Add retry to all selects in overlaycache Change-Id: I0356d71a35701f8e0ca04a34b2bb2aea666c1394	2020-11-29 16:46:57 -05:00
JT Olio	6bce907cb0	satellite: try to stream rollups to aggregation function to use less memory this change tries really hard to never have all of the storage node rollups in memory at the same time, up until the rollups are actually getting summed together. Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18	2020-11-29 10:26:32 -07:00
JT Olio	6aae21541f	satellitedb: do saverollup in batches Change-Id: I78278a192cba60541eee2986f54a88d5a479bd3e	2020-11-28 19:26:46 -07:00
JT Olio	0ba516d405	satellite: support pointing db components at different databases the immediate need is to be able to move the repair queue back out of cockroach if we can't save it. Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326	2020-11-28 18:39:16 +00:00
Egon Elbre	55d5e1fd7d	satellite/orders: ensure that expired deletion doesn't stall Add checks to ensure that when somebody uses empty options, the deletion doesn't loop infinitely. Change-Id: I1738fb1e7e1f8efbbb954c491cb6489f7bcdc2db	2020-11-23 14:52:40 +02:00
Ethan	2b92bba563	satellite/satellitedb/orders: Handle serial_numbers deletes in smaller increments on CRDB CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries. Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1	2020-11-20 13:44:52 +00:00
Moby von Briesen	a8b66dce17	satellite/accounting: account for old orders that can be submitted in satellite rollup With the new phase 3 order submission, orders can be added to the storage and bandwidth rollup tables at timestamps before the most recent rollup was run. This change shifts the start time of each new rollup window to account for any unexpired orders that might have been added since the previous rollup. A satellitedb migration is necessary to allow upserts in the accounting_rollups table when entries with identical node_ids and start_times are inserted. Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63	2020-11-18 14:46:00 -05:00
Moby von Briesen	0ec685b173	satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue We plan to add support for a new Reed-Solomon scheme soon, but our repair queue orders segments by least number of healthy pieces first. With a second RS scheme, fewer healthy pieces will not necessarily correlate to lower health. This change just adds the new column in a migration. A separate change will add the new health function. Right now, since we only support one RS scheme, behavior will not change. Number of healthy pieces is being inserted as "segment health" until the new health function is merged. Segment health is calculated with a new priority function created in commit `3e5640359`. In order to use the function, a new config value is added, called NodeFailureRate, representing the approximate probability of any individual node going down in the duration of one checker run. Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2	2020-11-16 21:18:09 +00:00
Jessica Grebenschikov	f558cc825e	satellite/orders: add storagenode_bw_phase2 table and dont delete tallies for longer It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed. This PR makes 2 changes: 1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before). 2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement. These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow. Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60	2020-11-13 17:15:24 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Egon Elbre	7183dca6cb	all: fix defers in loop defer should not be called in a loop. Change-Id: Ifa5a25a56402814b974bcdfb0c2fce56df8e7e59	2020-11-02 15:06:38 +02:00
Egon Elbre	11338e9beb	satellite/internalpb: move audithistory.pb Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4	2020-10-30 15:30:11 +02:00
Egon Elbre	7ce372c686	satellite/internalpb: add inspectors Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a	2020-10-30 13:28:17 +02:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
Egon Elbre	caefde6b32	private/{dbutil,tagsql}: pass ctx to database opening Database opening usually dial and hence we should pass ctx to them. Change-Id: Iaa2875981570d83e65be3710f841cf30349f807b	2020-10-29 10:51:29 +00:00
Egon Elbre	e3985799a1	storage/{cockroachkv,postgreskv}: add ctx to opening Database opening usually dial and hence we should pass ctx to them. Change-Id: Iecf41241aaa94d54506cbc80b0e53449848d8819	2020-10-29 10:49:08 +00:00
Egon Elbre	9b2e00a38b	satellite: pass ctx into satellitedb.Open Opening a database requires ctx, this is first step to passing ctx to the appropriate level. Change-Id: Ic303e69f868ef3449ae36377937a29670cf635e2	2020-10-29 06:38:37 +00:00
Cameron Ayer	bb7be23115	satellite/{audit,overlay,satellitedb}: enable reporting offline audits - Remove flag for switching off offline audit reporting. - Change the overlay method used from UpdateUptime to BatchUpdateStats, as this is where the new online scoring is done. - Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same method to record offline audits as success, failure, and unknown, we need to distinguish offline audits from the rest. Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355	2020-10-27 10:44:46 +00:00
Ethan	9a29ec5b3e	Add index to graceful_exit_transfer_queue table This fixes a slow query that was taking up to 4 seconds in production SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count FROM graceful_exit_transfer_queue WHERE node_id = '[redacted]' AND finished_at is NULL AND last_failed_at is NULL ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0; Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549	2020-10-26 14:47:48 +00:00
Moby von Briesen	7c3afe164b	satellite/overlay: uncomment dq for offline and disable with feature flag Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27	2020-10-16 12:55:16 +00:00
Yaroslav Vorobiov	139a7ee959	private/migrate: add ablity to create dbs during migration Use tagsql.DB pointer as step database, to propagate changes back and forth between actual database and migration. Adds CreateDB operation to the migration step to be able to create new dbs before executing migration action. Adjusts storagenode database migration to use inner tagsql.DB pointer of each database as step.DB. Adjusts satellite dabase migration, adds proxy migrationDB field to satellite db that wraps itself as tagsql.DB, pointer of which is used as step.DB. Change-Id: Ifed4de5b01a356cf7b37db64d2eaeb7b61982c5c	2020-10-15 15:28:04 +03:00
Stefan Benten	0b43b93259	satellite/satellitedb: make limits per default NULL This change completes the column migration of `5f6fccc6e8` and `2f648fd981`. It resets every users project limits who are below or equal to our current production defaults. Change-Id: Ie041d08bb67b62844f6023190fc00bc2dad5b1cb	2020-10-14 20:28:16 +00:00
Egon Elbre	2268cc1df3	all: fix linter complaints Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95	2020-10-13 15:59:01 +03:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Jeff Wendling	0f0faf0a9f	satellite/orders: do a better job limiting concurrent requests Doing it at the ProcessOrders level was insufficient: the endpoints make multiple database calls. It was a misguided attempt to only have one spot enter the semaphore. By putting it in the endpoint we can not only be sure that the concurrency is correctly limited but it can be configurable easily. Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64	2020-10-09 16:27:15 -04:00
Jeff Wendling	7c303208ff	satellite/satellitedb: emergency temporary order processing semaphore we have thundering herds of order submissions that take all of the database connections causing temporary periodic outages. limit the amount of concurrent order processing to 2. Change-Id: If3f86cdbd21085a4414c2ff17d9ef6d8839a6c2b	2020-10-08 19:16:47 +00:00
Cameron Ayer	b39a99bae6	satellite/{overlay,satellitedb}: always show node's real online score Previously if a node did not have audit history data for each of the windows over the tracking period, we would give them the benefit of the doubt and set their score to 1. This was to prevent nodes from being suspended right out the gate. We need a minimum amount of data to evaluate them. However, a node who is actually failing at being online will have no idea until they have received enough audits and we suspend them. Instead, we will always use their real score, but use a flag to determine whether they are eligible for suspension/dq. Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf	2020-10-02 12:28:11 -04:00
Cameron Ayer	c2525ba2b5	satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration Repair workers prioritize the most unhealthy segments. This has the consequence that when we finally begin to reach the end of the queue, a good portion of the remaining segments are healthy again as their nodes have come back online. This makes it appear that there are more injured segments than there actually are. solution: Any time the checker observes an injured segment it inserts it into the repair queue or updates it if it already exists. Therefore, we can determine which segments are no longer injured if they were not inserted or updated by the last checker iteration. To do this we add a new column to the injured segments table, updated_at, which is set to the current time when a segment is inserted or updated. At the end of the checker iteration, we can delete any items where updated_at < checker start. Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94	2020-09-29 20:38:22 +00:00
Egon Elbre	c23a8e3b81	go.mod: update pgx to v4.9.0 Fix query to use TextArray instead of VarcharArray. Fix queries to use the correct type. Change-Id: Ibb7e55adba277d05778118d81ca697470e72c374	2020-09-29 19:03:08 +00:00
Egon Elbre	2d27bc8787	satellite/satellitedb: separate cockroach for migration tests Currently Cockroach migration test is the most heavy with regards to schema changes. This causes other tests to time out. This adds an alternate cockroach instance that is used for migration tests. Change-Id: I01fe9313527ff002f0bb0914dd52c3645b8eaf6d	2020-09-29 09:31:33 +00:00
Jessica Grebenschikov	4a2c66fa06	satellite/accounting: add cache for getting project storage and bw limits This PR adds the following items: 1) an in-memory read-only cache thats stores project limit info for projectIDs This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out. The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache. The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period.. Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f	2020-09-25 16:28:49 +00:00
Stefan Benten	38108828ac	satellite/satellitedb: enable multiple projects existing users Change-Id: I2ef77182d5464d72574698c8abfbbfdbda3f5a9e	2020-09-23 18:17:38 +02:00
Stefan Benten	5f6fccc6e8	satellite/satellitedb: makes limits nullable change backwards compatible Our current endpoints bail on us, if the column data is null. Thus we need to take the intermediate step and set the default to a fixed value and reset those with the following release. It sets the default column value to our current config values of 50GB for storage and bandwidth and 100 buckets, while still enabling the field to be nullable. All 0 values are migrated to be the default as well to ensure they can keep using their projects, as with the original change, 0 actually means 0. Change-Id: I797be80ce2d2105091599dc1b3fc76f74336b66b	2020-09-23 17:54:42 +02:00
Stefan Benten	2f648fd981	satellite: make limits be nullable Currently we have no way to actually set one of the following limits to 0 (meaning not usable): - maxBuckets - usageLimit - bandwidthLimit With having the field nullable, NULL corresponds to the global default, 0 now actually 0 and a set value determines a custom limit. Change-Id: I92bb77529dcbd0881ae8368921be9d246eb0919e	2020-09-21 19:34:19 +00:00
Qweder93	8182fdad0b	storagenode: heldamount renamed to payouts, renamed some methods and structs to more meaningful names. grouped estimated payout with pathouts satellite: heldamount renamed to SNOpayouts. Change-Id: I244b4d2454e0621f4b8e22d3c0d3e602c0bbcb02	2020-09-16 14:57:35 +00:00
Cameron Ayer	e7c34a053d	satellite/satellitedb: add column and index "updated_at" to injuredsegments Change-Id: I59e9bb2077885f09e17795375fe98ed31bd83d54	2020-09-14 12:53:04 -04:00

1 2 3 4 5 ...

615 Commits