storj

Author	SHA1	Message	Date
Cameron Ayer	a7cda642a5	satellite/repair: add logging for irreparable segments in checker If the checker sees an irreparable segment, log out some info so we can see what the problem is Change-Id: I76eda5270214205f4fefc646d6c391cc13ddcafd	2021-09-02 12:35:29 -04:00
Cameron Ayer	51fdceafef	satellite/repair: increment repair_too_many_nodes_failed with 0 for redash alerting Change-Id: I990c8df7be30493705278b24954262834a1ed81f	2021-08-27 17:42:11 +00:00
Cameron Ayer	26f839a445	satellite/repair/repairer: if not enough nodes for repair order limits, increment metric and log as irreparable segment Change-Id: I4bd46f28d64278c8d463e885ad221aafb6ce7cf3	2021-08-27 13:42:28 +00:00
Cameron Ayer	dc69e1b16e	satellite/repair: use mutex instead of channel to collect download errors Change-Id: I3f958e9cc95126a25f73ccd105e614b51089edc5	2021-08-10 15:29:39 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Clement Sam	1f353f3231	segment/{metabase,repair}: change segment created_at column to not accept nulls This change adds a NOT NULL constraint to the created_at column in the segment table. All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc) Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52	2021-08-06 08:16:28 +00:00
Clement Sam	f06e7c5f60	segment/{metabase,repair}: add dedicated methods on metabase.Pieces This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates. Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc	2021-08-03 15:12:03 +00:00
Michał Niewrzał	0d8e7905c1	satellite/repair/checker: don't return error when joining loop Error from joining loop should not restart satellite. This will be the same error like for loop itself. In the same way we are handling joining error for other services that are using segment loop. Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e	2021-08-03 12:56:42 +00:00
Yingrong Zhao	f8914ccce0	satellite/{repair, overlay}: use reputation store in repair Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a	2021-07-28 13:22:05 -04:00
Michał Niewrzał	a883d7f582	satellite/repair/checker: fix `remote_files_checked` metric While metaloop refactoring we missed metric for all objects processed by repair checker. Change-Id: I100f10a36c52e2651923ecaa377261752877d673	2021-07-22 14:48:08 +00:00
Cameron Ayer	449c873681	satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first Sometimes we see timeouts from DNS lookups when trying to do repair GETs. Solution: try using node's last IP and port first. If we can't connect, retry with DNS lookup. Change-Id: I59e223aebb436118779fb18378f6e09d072f12be	2021-07-21 13:13:06 +00:00
Cameron Ayer	373ba8fd27	satellite/repair/repairer: metrics for repair bytes uploaded and downloaded Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48	2021-07-21 09:23:19 +00:00
Michał Niewrzał	b900f6b4f9	satellite/repair/checker: move checker to segment loop Change-Id: I04b25e4fa14c822c9524586c25bde89db2a6cad9	2021-07-01 13:51:56 +00:00
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Michał Niewrzał	a93e47514a	satellite: remove irreparabledb This is part of metaloop refactoring. We plan to remove irreparable at some point but there was not time for it. Now instead refatoring it for segmentloop its just easier to drop it. Later we still need to drop table with migration step. Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4	2021-06-17 07:20:15 +00:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Egon Elbre	10372afbe4	ci: fix lint errors Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf	2021-05-17 13:37:31 +00:00
Cameron Ayer	3ea7aa2c7a	satellite/repair/repairer: log piece hash verification failures Piece hash verification failures during repair download are considered audit failures, but we are not logging these occurrences. Now we log them. Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653	2021-05-14 15:03:15 +00:00
Egon Elbre	910eec8eee	satellite/metainfo: remove MetabaseDB interface Currently the interface is not useful. When we need to vary the implementation for testing purposes we can introduce a local interface for the service/chore that needs it, rather than using the large api. Unfortunately, this requires adding a cleanup callback for tests, there might be a better solution to this problem. Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682	2021-05-13 13:22:14 +00:00
Egon Elbre	69b149a66f	mod: bump uplink uplink stopped using zap, hence some of the private methods needed to be changed. Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5	2021-05-06 14:48:36 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Michał Niewrzał	7944df20d6	storj: use multipart API Change-Id: I10b401434e3e77468d12ecd225b41689568fd197	2021-04-26 13:15:09 +00:00
Egon Elbre	a2e20c93ae	private/dbutil: use dbutil and tagsql from storj.io/private Initially we duplicated the code to avoid large scale changes to the packages. Now we are past metainfo refactor we can remove the duplication. Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef	2021-04-23 14:36:52 +03:00
Egon Elbre	4c9ed64f75	satellite/metabase/metaloop: move loop under metabase Currently the loop handling is heavily related to the metabase rather than metainfo. metainfo over time has become related to the "public API" for accessing the metabase data. Currently updates monkit.lock, because monkit monitoring does not handle ScopeNamed correctly. Needs a followup change to monitoring check. Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584	2021-04-22 12:58:09 +03:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Fadila Khadar	bde367ae73	satellite/gc: check on bloom filter creation date Check that the bloom filter creation date is earlier than the metainfo loop system time used for db scanning. Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4	2021-04-14 16:40:37 +00:00
Michał Niewrzał	a5224e7a6c	satellite/metainfo/metaloop: use segment CreatedAt and RepairedAt Repair checker expects to have information about CreatedAt and RepairedAt fields to calculate segment age metric. Change-Id: I6b41df880d77133be541e14d10d91cc75759b339	2021-04-02 08:46:54 +00:00
Kaloyan Raev	035c393da0	satellite: update tests to pass etag.Reader to multipart.PutObjectPart Change-Id: Ibe99357945ae7a91f5b5d4f87b83d425c9fa84a5	2021-03-29 13:18:11 +00:00
Michał Niewrzał	141444f6d6	satellite/repair/repairer: fix segmentAge metric Change-Id: I146b3163aa1bfab5ee060298e6bf9822ca6820a0	2021-03-29 12:29:47 +00:00
Egon Elbre	86e698f572	pb: use *UnimplementedServer to avoid breaking API changes Change-Id: I99a34eeb37ac4453411f273511710562a519f57a	2021-03-29 12:26:10 +03:00
Egon Elbre	f19ef4afe5	satellite/metainfo/metaloop: move loop to a separate package Change-Id: I94c931a27c1af6062185ec62688624ec02050f11	2021-03-23 15:37:34 +00:00
Michał Niewrzał	27ae0d1f15	satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy. Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576	2021-03-22 08:12:56 +00:00
Egon Elbre	4c0ea717eb	satellite/metainfo: remove unneeded dependencies from Loop metainfo.Loop doesn't require buckets nor pointerdb anymore. Also: * fix comments * update full iterator limit to 2500 Change-Id: I6604402868f5c34079197c407f969ac8015e63c5	2021-02-19 15:11:16 +02:00
Egon Elbre	c860b74a37	satellite/repair/checker: allow for multipart objects We have multipart objects so we may get multiple inline segments sequences or no segments at all for objects. Change-Id: Ie46ee777a2db8f18f7154e3443bb9e07ecb170f7	2021-02-18 20:31:49 +02:00
Michał Niewrzał	908a96ae30	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791	2021-02-11 11:48:23 +01:00
Cameron Ayer	4a797baa73	satellite/repair/repairer: a new set of rs_scheme tagged metrics Change-Id: Ibecd9265da881247eeb85ba185ee8877a7243777	2021-02-09 14:19:22 +00:00
Michał Niewrzał	9a60011774	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: Ia90f29be432e207c4125f7f955c912978eabe59a	2021-02-04 09:38:08 +01:00
Kaloyan Raev	8d25c47897	satellite/repair: fix comment in TestRepairExpiredSegment Change-Id: Ib91e81f6ba0a7f65daed157b78f7a1a108984930	2021-02-03 10:09:49 +02:00
Kaloyan Raev	038bd0a4da	satellite/repair/repairer: fix repair for pending objects https://storjlabs.atlassian.net/browse/PG-160 Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508	2021-02-02 19:50:10 +02:00
Kaloyan Raev	6f3d0c4ad5	Merge remote-tracking branch 'origin/main' into multipart-upload Conflicts: go.mod go.sum satellite/repair/repair_test.go satellite/repair/repairer/segments.go Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906	2021-02-02 19:19:04 +02:00
Kaloyan Raev	339d1212cd	satellite/repair: don't remove expired segments from repair queue It's impossible to time correctly this check. The segment may expire just at the time we upload the repaired pieces to new storage nodes. They will reject this as expired and the repair will fail. Also, we penalize storage nodes with audit failure only if they fail piece hash verification, i.e. return incorrect data, but only if they have already deleted the piece. So, it would be best if the repair service does not care about object expiration at all. This is a responsibility of another service. Removing this check will also simplify how we migrate this code correctly to the metabase. Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2	2021-02-02 16:13:01 +00:00
Kaloyan Raev	d0612199f0	Merge remote-tracking branch 'origin/main' into multipart-upload Conflicts: go.mod go.sum satellite/metainfo/config.go satellite/metainfo/metainfo_test.go Change-Id: I95cf3c1d020a7918795b5eec63f36112fdb86749	2021-02-01 14:32:12 +02:00
Cameron Ayer	89e682b4d7	satellite/repair/checker: add 29/80/130-52 to default repair overrides Change-Id: I2e5a7538fdf33f3869fcb65fc88f7abb10faad79	2021-01-28 16:55:16 -05:00
Michał Niewrzał	ec88d21a3c	Merge 'main' branch. Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae	2021-01-11 10:26:58 +01:00
Moby von Briesen	a90d6fcad8	satellite/repair/checker: Use segment health on checker insert Do not insert the number of healthy pieces for segment health anymore. Rather, insert the segment health calculated by our new priority function. Change-Id: Ieee7fb2deee89f4d79ae85bac7f577befa2a0c7f	2021-01-04 11:48:17 -05:00
Michał Niewrzał	ad3e3a38c5	Merge 'main' branch Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200	2021-01-04 12:13:45 +01:00
paul cannon	7246368ca1	satellite/repair: clamp totalNodes to 100 or higher Change-Id: I239418ed3671b1cee30b0b1797dc434244e72448	2020-12-30 10:39:14 -06:00
Ethan Adams	6070018021	satellite/overlay: use AS OF SYSTEM TIME with Cockroach Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.	2020-12-22 21:07:07 +02:00
Kaloyan Raev	bafc6af992	ci: remove workaround for failing tests Change-Id: I3eb673fae6c81bee17d7437cb870d5f5ba6978d5	2020-12-21 18:07:40 +02:00
Kaloyan Raev	4d37d14929	satellite/{metrics,repair}: adjust monitoring to new metainfo loop Change-Id: I87a2145daa5ed49bb2c08d6967baa09c0b14b4c6	2020-12-21 09:05:17 +02:00
Michal Niewrzal	f7a31308db	satellite/repair: enable TestRemoveExpiredSegmentFromQueue test Change adds ability to set `now` time during test for repair. Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f	2020-12-18 10:58:05 +00:00
Michal Niewrzal	2111740236	Merge 'master' branch Change-Id: Ib73af0ff3ce0e9a1547b0b9fc55bf88704f6f394	2020-12-18 09:13:24 +01:00
paul cannon	d3604a5e90	satellite/repair: use survivability model for segment health The chief segment health models we've come up with are the "immediate danger" model and the "survivability" model. The former calculates the chance of losing a segment becoming lost in the next time period (using the CDF of the binomial distribution to estimate the chance of x nodes failing in that period), while the latter estimates the number of iterations for which a segment can be expected to survive (using the mean of the negative binomial distribution). The immediate danger model was a promising one for comparing segment health across segments with different RS parameters, as it is more precisely what we want to prevent, but it turns out that practically all segments in production have infinite health, as the chance of losing segments with any reasonable estimate of node failure rate is smaller than DBL_EPSILON, the smallest possible difference from 1.0 representable in a float64 (about 1e-16). Leaving aside the wisdom of worrying about the repair of segments that have less than a 1e-16 chance of being lost, we want to be extremely conservative and proactive in our repair efforts, and the health of the segments we have been repairing thus far also evaluates to infinity under the immediate danger model. Thus, we find ourselves reaching for an alternative. Dr. Ben saves the day: the survivability model is a reasonably close approximation of the immediate danger model, and even better, it is far simpler to calculate and yields manageable values for real-world segments. The downside to it is that it requires as input an estimate of the total number of active nodes. This change replaces the segment health calculation to use the survivability model, and reinstates the call to SegmentHealth() where it was reverted. It gets estimates for the total number of active nodes by leveraging the reliability cache. Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc	2020-12-17 21:30:17 +00:00
Michal Niewrzal	70ba4deea9	satellite/repair/checker: adjust irreparable part of repair checker Change-Id: I0732104a97ba18a5359de3966cd692677a0ff790	2020-12-17 14:11:22 +00:00
Kaloyan Raev	9aa61245d0	satellite/audits: migrate to metabase Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2	2020-12-17 14:38:48 +02:00
Michal Niewrzal	2381ca2810	Merge 'master' branch Change-Id: I4a3e45a2a2cdacfd87d16b148cfb4c6671c20b15	2020-12-17 13:17:17 +01:00
Michal Niewrzal	8d3ea9c251	satellite/repair/repairer: implement SegmentRepairer with metabase Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c	2020-12-17 10:47:21 +00:00
Cameron Ayer	8c52bb3a18	satellite/checker: use numHealthy as segment health in repair queue A few weeks ago it was discovered that the segment health function was not working as expected with production values. As a bandaid, we decided to insert the number of healthy pieces into the segment health column. This should have effectively reverted our means of prioritizing repair to the previous implementation. However, it turns out that the bandaid was placed into the code which removes items from the irreparable db and inserts them into the repair queue. This change: insert number of healthy pieces into the repair queue in the method, RemoteSegment Change-Id: Iabfc7984df0a928066b69e9aecb6f615253f1ad2	2020-12-15 17:16:59 -05:00
Cameron Ayer	2ac72eaf16	satellite/repair/checker: add new monkit stats tagged with rs scheme There is a new checker field called statsCollector. This contains a map of stats pointers where the key is a stringified redundancy scheme. stats contains all tagged monkit metrics. These metrics exist under the key name, "tagged_repair_stats", which is tagged with the name of each metric and a corresponding rs scheme. As the metainfo observer works on a segment, it checks statsCollector for a stats corresponding to the segment's redundancy scheme. If one doesn't exist, it is created and chained to the monkit scope. Now we can call Observe, Inc, etc on the fields just like before, and they have tags! durabilityStats has also been renamed to aggregateStats. At the end of the metainfo loop, we insert the aggregateStats totals into the corresponding stats fields for metric reporting. Change-Id: I8aa1918351d246a8ef818b9712ed4cb39d1ea9c6	2020-12-15 14:08:01 +00:00
Michal Niewrzal	934ae32ca4	satellite/repair/checker: fix checker tests Change-Id: I63d3368a07b800fdb10bb93b847eb32927b8c0dc	2020-12-15 10:47:42 +00:00
Michal Niewrzal	57f374af24	Merge 'master' branch Change-Id: Idf6b10ea7ca94e4d232e6a3b6a38ef2e646ba197	2020-12-15 08:26:53 +01:00
Kaloyan Raev	fc85179a19	satellite/metainfo: refactor SegmentLocation.Index to SegmentPosition Change-Id: Ic9403c8126712693326dd83d6ba4f3b84be3e0c7	2020-12-14 13:35:53 +02:00
Jessica Grebenschikov	0649d2b930	satellite/repair: improve contention for injuredsegments table on CRDB We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem: 1) crdb doesn't support `FOR UPDATE SKIP LOCKED` 2) the original crdb Select query was doing 2 full table scans and not using any indexes 3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention. The changes in this PR should help to reduce that contention and improve performance on CRDB. The changes include: 1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB. - Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/). 2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability. - While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar.. This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors. Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23	2020-12-10 09:51:26 -08:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Egon Elbre	28ea63be92	satellite/repair: avoid TestDBAccess Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf	2020-11-30 13:29:08 +02:00
JT Olio	0ba516d405	satellite: support pointing db components at different databases the immediate need is to be able to move the repair queue back out of cockroach if we can't save it. Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326	2020-11-28 18:39:16 +00:00
Moby von Briesen	75f0f713a3	satellite/repair/checker/checker.go: Use number of healthy pieces instead of SegmentHealth for injured segments queue. We did not test the SegmentHealth function with actual production values, and it turns out that values such as 52 healthy, 35 minimum result in +Inf segment health - so pretty much all segments put into the repair queue have the same health, which means we effectively aren't sorting by health. This change inserts numHealthy as segment health into the database so the segments are ordered as they were before. We need to refine the SegmentHealth function before we can support multi RS. Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74	2020-11-28 12:16:32 -05:00
Moby von Briesen	575f50df84	satellite/repair: Update repair override config to support multiple RS schemes. Rather than having a single repair override value, we will now support repair override values based on a particular segment's RS scheme. The new format for RS override values is "k/o/n-override,k/o/n-override..." Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d	2020-11-23 18:01:15 +00:00
paul cannon	2b59640f18	cmd/satellite: ignore Canceled in exit from repair worker Firstly, this changes the repair functionality to return Canceled errors when a repair is canceled during the Get phase. Previously, because we do not track individual errors per piece, this would just show up as a failure to download enough pieces to repair the segment, which would cause the segment to be added to the IrreparableDB, which is entirely unhelpful. Then, ignore Canceled errors in the return value of the repair worker. Apparently, when the worker returns an error, that makes Cobra exit the program with a nonzero exit code, which causes some piece of our deployment automation to freak out and page people. And when we ask the repair worker to shut down, "canceled" errors are what we _expect_, not an error case. Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435	2020-11-17 21:37:59 +00:00
Moby von Briesen	0ec685b173	satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue We plan to add support for a new Reed-Solomon scheme soon, but our repair queue orders segments by least number of healthy pieces first. With a second RS scheme, fewer healthy pieces will not necessarily correlate to lower health. This change just adds the new column in a migration. A separate change will add the new health function. Right now, since we only support one RS scheme, behavior will not change. Number of healthy pieces is being inserted as "segment health" until the new health function is merged. Segment health is calculated with a new priority function created in commit `3e5640359`. In order to use the function, a new config value is added, called NodeFailureRate, representing the approximate probability of any individual node going down in the duration of one checker run. Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2	2020-11-16 21:18:09 +00:00
paul cannon	3e56403599	satellite/repair: add a repair health function This will be used to rank segments in need of repair for attention by the repair workers. Change-Id: I5b70650cec933696b4c6d73bb7efb97e3efdf24a	2020-11-11 18:48:51 +00:00
Cameron Ayer	da9f1f0611	satellite/repair: add monkit counter for segments below minimum required The current monkit reporting for "remote_segments_lost" is not usable for triggering alerts, as it has reported no data. To allow alerting, two new metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req" will increment by zero on each segment unless it is below the minimum required piece count. The two metrics report what is found by the checker and the repairer respectively. Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7	2020-11-11 12:48:23 +00:00
Moby von Briesen	db6bc6503d	satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes. Make metainfo.RSConfig a valid pflag config value. This allows us to configure the RSConfig as a string like k/m/o/n-shareSize, which makes having multiple supported RS schemes easier in the future. RS-related config values that are no longer needed have been removed (MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify). Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1	2020-11-09 22:16:13 +00:00
Cameron Ayer	d63b7658e8	satellite/repair: fix lastSeenSegmentKey bug in IrreparableProcess A change was made to use a metabase.SegmentKey (a byte slice alias) as the last seen item to iterate through the irreparable DB in a for loop. However, this SegmentKey was not initialized, thus it was nil. This caused the DB query to return nothing, and healthy segments could not be cleaned out of the irreparable DB. Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81	2020-11-09 14:48:15 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Egon Elbre	7ce372c686	satellite/internalpb: add inspectors Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a	2020-10-30 13:28:17 +02:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
littleskunk	ed1f6d7973	satellite/config: move repair override from config to default (#3958 ) Co-authored-by: Igor <38665104+ihaid@users.noreply.github.com>	2020-10-28 17:24:39 +02:00
Kaloyan Raev	92a2be2abd	satellite/metainfo: get away from using pb.Pointer in Metainfo Loop As part of the Metainfo Refactoring, we need to make the Metainfo Loop working with both the current PointerDB and the new Metabase. Thus, the Metainfo Loop should pass to the Observer interface more specific Object and Segment types instead of pb.Pointer. After this change, there are still a couple of use cases that require access to the pb.Pointer (hence we have it as a field in the metainfo.Segment type): 1. Expired Deletion Service 2. Repair Service It would require additional refactoring in these two services before we are able to clean this. Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f	2020-10-27 13:06:47 +00:00
Egon Elbre	2268cc1df3	all: fix linter complaints Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95	2020-10-13 15:59:01 +03:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Cameron Ayer	c2525ba2b5	satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration Repair workers prioritize the most unhealthy segments. This has the consequence that when we finally begin to reach the end of the queue, a good portion of the remaining segments are healthy again as their nodes have come back online. This makes it appear that there are more injured segments than there actually are. solution: Any time the checker observes an injured segment it inserts it into the repair queue or updates it if it already exists. Therefore, we can determine which segments are no longer injured if they were not inserted or updated by the last checker iteration. To do this we add a new column to the injured segments table, updated_at, which is set to the current time when a segment is inserted or updated. At the end of the checker iteration, we can delete any items where updated_at < checker start. Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94	2020-09-29 20:38:22 +00:00
Michal Niewrzal	27a9d14e2a	satellite/repair: use metabase.SegmentKey type in repair package Another change which is a part of refactoring to replace path parameter (string/[]byte) with key paramter (metabase.SegmentKey) Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928	2020-09-08 19:35:20 +00:00
Michal Niewrzal	9202295348	satellite/metainfo: replace ScopedPath with metabase.SegmentLocation Change-Id: I7e89c9e8eaeae58be828a32ad47ed3028501f4c7	2020-09-04 10:06:52 +00:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
JT Olio	b872fe52a1	satellite/repair: switch to piecestore.UploadReader Change-Id: Ia99ad2cf5422e6ba1d98b32946740f9cadba7b6d	2020-09-01 09:26:54 -06:00
Michal Niewrzal	0604a672c1	satellite/metainfo: use metabase in loop Change-Id: I1bb0c6fe0a762895fde950690b06f7dd9d77e178	2020-09-01 10:06:16 +00:00
Moby von Briesen	5d21e85529	satellite/audit/queue: Separate audit queue into two separate structs. * The audit worker wants to get items from the queue and process them. * The audit chore wants to create new queues and swap them in when the old queue has been processed. This change adds a "Queues" struct which handles the concurrency issues around the worker fetching a queue and the chore swapping a new queue in. It simplifies the logic of the "Queue" struct to its bare bones, so that it behaves like a normal queue with no need to understand the details of swapping and worker/chore interactions. Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b	2020-08-31 20:51:25 +00:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	5dfe27f175	satellite/{repair,overlay}: Use overlay NodeSelectionCache for repair uploads This change removes the overlay function FindStorageNodesForRepair, which skips using the node selection cache and hits the database directly. Otherwise, it is functionally identical to FindStorageNodesForUpload, which checks the node selection cache first. When selecting nodes for PUT_REPAIRs, we now call FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce database load. Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e	2020-08-04 12:50:12 -04:00
Moby von Briesen	76030a8237	satellite/audit/{queue,chore}: Wait for audit queue to be finished before swapping * Do not swap the active audit queue with the pending audit queue until the active audit queue is empty. * Do not begin creating a new pending audit queue until the existing pending audit queue has been swapped to the active queue. Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317	2020-07-28 16:56:26 +00:00
Egon Elbre	44f9193404	satellite/orders: make optimal threshold multiplier into an argument It feels weird having a repairer configuration part of order services. Let's have a single source of truth for it. Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b	2020-07-24 16:35:59 +03:00
Cameron Ayer	e14f7a3fb4	satellite/repair: update healthyPieces and unhealthyPieces after CreateGetRepairOrderLimits Inside CreateGetRepairOrderLimits we pass in a list of healthy pieces, but when we query node info from this list we apply the "reliable" filter again. We sometimes end up with nodes which at first were healthy, but then became unhealthy, and thus can be repaired, but we do not update the 'unhealthyPieces' list with these nodes. This causes an error, 'piece to add already exists', as we fail to remove these pieces from the pointer before replacing them with repaired pieces. Change-Id: I6e2445f342ac117ded30351fa7e5e523c9ec26bd	2020-07-23 13:24:46 +00:00
Egon Elbre	d8dcae3075	all: fix error checking Change-Id: Ia0da1bbd6ce695139922f94096c2419281905e32	2020-07-16 19:13:14 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
paul cannon	bbdb351e5e	all: use jackc/pgx in place of lib/pq What: Use the github.com/jackc/pgx postgresql driver in place of github.com/lib/pq. Why: github.com/lib/pq has some problems with error handling and context cancellations (i.e. it might even issue queries or DML statements more than once! see https://github.com/lib/pq/issues/939). The github.com/jackx/pgx library appears not to have these problems, and also appears to be better engineered and implemented (in particular, it doesn't use "exceptions by panic"). It should also give us some performance improvements in some cases, and even more so if we can use it directly instead of going through the database/sql layer. Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044	2020-07-13 15:54:41 +00:00
paul cannon	4997fd55d0	satellite/repair: remove healthy from irreparabledb Change-Id: Ia9d300d0359883f03734d0bdf204d56d6642ce34	2020-06-26 21:26:00 +00:00
Cameron Ayer	3b4b5f45c7	satellite: replace references to Suspended with UnknownAuditSuspended Change-Id: I3d2d00c95954c0546ad077702617895f262926ef	2020-06-23 14:19:22 +00:00
Egon Elbre	410d897840	satellite: fix string(int) conversions Change-Id: I54c6ca8c2dad3c321175f72271b7536cc2a4df09	2020-06-12 06:41:34 +00:00
Moby von Briesen	e7e69f383a	satellite/repair/repairer/ec.go: Add monkit tracing for ec repairer Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and ecrepairer.putPiece so we can get more accurate estimates of node performance during repair. Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d	2020-05-29 14:00:45 +00:00

1 2 3 4 5

246 Commits