storj

Author	SHA1	Message	Date
Yingrong Zhao	35e4a87e60	satellite/repair: ignore expired segments at the beginning of the repair work Since we have changed the repair worker to also mark a node as audit failure if they return a not found error, we should ignore expired segments when possible Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c	2021-10-18 18:15:39 +00:00
Yaroslav Vorobiov	4b79f5ea86	satellite/repair: test if audit scores increases during repair Update repair tests to check if audit score increases for nodes that successfully send pieces during successfull and failed repairs. Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9	2021-10-04 19:39:13 +00:00
Yaroslav Vorobiov	469ae72c19	satellite/repair: update audit records during repair Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0	2021-09-24 00:48:13 +00:00
Cameron Ayer	51fdceafef	satellite/repair: increment repair_too_many_nodes_failed with 0 for redash alerting Change-Id: I990c8df7be30493705278b24954262834a1ed81f	2021-08-27 17:42:11 +00:00
Cameron Ayer	26f839a445	satellite/repair/repairer: if not enough nodes for repair order limits, increment metric and log as irreparable segment Change-Id: I4bd46f28d64278c8d463e885ad221aafb6ce7cf3	2021-08-27 13:42:28 +00:00
Cameron Ayer	dc69e1b16e	satellite/repair: use mutex instead of channel to collect download errors Change-Id: I3f958e9cc95126a25f73ccd105e614b51089edc5	2021-08-10 15:29:39 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Clement Sam	1f353f3231	segment/{metabase,repair}: change segment created_at column to not accept nulls This change adds a NOT NULL constraint to the created_at column in the segment table. All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc) Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52	2021-08-06 08:16:28 +00:00
Clement Sam	f06e7c5f60	segment/{metabase,repair}: add dedicated methods on metabase.Pieces This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates. Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc	2021-08-03 15:12:03 +00:00
Yingrong Zhao	f8914ccce0	satellite/{repair, overlay}: use reputation store in repair Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a	2021-07-28 13:22:05 -04:00
Cameron Ayer	449c873681	satellite/repair/repairer: attempt repair GETs using nodes' last IP and port first Sometimes we see timeouts from DNS lookups when trying to do repair GETs. Solution: try using node's last IP and port first. If we can't connect, retry with DNS lookup. Change-Id: I59e223aebb436118779fb18378f6e09d072f12be	2021-07-21 13:13:06 +00:00
Cameron Ayer	373ba8fd27	satellite/repair/repairer: metrics for repair bytes uploaded and downloaded Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48	2021-07-21 09:23:19 +00:00
Michał Niewrzał	d53aacc058	satellite/repair: migrate to new repair_queue table We want to use StreamID/Position to identify injured segment. As it is hard to alter existing injuredsegments table we are adding a new table that will replace existing one. Old table will be dropped later. Change-Id: I0d3b06522645013178b6678c19378ebafe485c49	2021-06-30 17:12:24 +02:00
Michał Niewrzał	a93e47514a	satellite: remove irreparabledb This is part of metaloop refactoring. We plan to remove irreparable at some point but there was not time for it. Now instead refatoring it for segmentloop its just easier to drop it. Later we still need to drop table with migration step. Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4	2021-06-17 07:20:15 +00:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Egon Elbre	10372afbe4	ci: fix lint errors Change-Id: Ib5893440807811f77175ccd347aa3f8ca9cccbdf	2021-05-17 13:37:31 +00:00
Cameron Ayer	3ea7aa2c7a	satellite/repair/repairer: log piece hash verification failures Piece hash verification failures during repair download are considered audit failures, but we are not logging these occurrences. Now we log them. Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653	2021-05-14 15:03:15 +00:00
Egon Elbre	910eec8eee	satellite/metainfo: remove MetabaseDB interface Currently the interface is not useful. When we need to vary the implementation for testing purposes we can introduce a local interface for the service/chore that needs it, rather than using the large api. Unfortunately, this requires adding a cleanup callback for tests, there might be a better solution to this problem. Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682	2021-05-13 13:22:14 +00:00
Egon Elbre	69b149a66f	mod: bump uplink uplink stopped using zap, hence some of the private methods needed to be changed. Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5	2021-05-06 14:48:36 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Michał Niewrzał	141444f6d6	satellite/repair/repairer: fix segmentAge metric Change-Id: I146b3163aa1bfab5ee060298e6bf9822ca6820a0	2021-03-29 12:29:47 +00:00
Michał Niewrzał	27ae0d1f15	satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy. Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576	2021-03-22 08:12:56 +00:00
Michał Niewrzał	908a96ae30	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: I075aaff42ca3f5dc538356cedfccd5939c75e791	2021-02-11 11:48:23 +01:00
Cameron Ayer	4a797baa73	satellite/repair/repairer: a new set of rs_scheme tagged metrics Change-Id: Ibecd9265da881247eeb85ba185ee8877a7243777	2021-02-09 14:19:22 +00:00
Kaloyan Raev	038bd0a4da	satellite/repair/repairer: fix repair for pending objects https://storjlabs.atlassian.net/browse/PG-160 Change-Id: Ice7a0dcfc591bcde85a355cf95fff1eb3411f508	2021-02-02 19:50:10 +02:00
Kaloyan Raev	6f3d0c4ad5	Merge remote-tracking branch 'origin/main' into multipart-upload Conflicts: go.mod go.sum satellite/repair/repair_test.go satellite/repair/repairer/segments.go Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906	2021-02-02 19:19:04 +02:00
Kaloyan Raev	339d1212cd	satellite/repair: don't remove expired segments from repair queue It's impossible to time correctly this check. The segment may expire just at the time we upload the repaired pieces to new storage nodes. They will reject this as expired and the repair will fail. Also, we penalize storage nodes with audit failure only if they fail piece hash verification, i.e. return incorrect data, but only if they have already deleted the piece. So, it would be best if the repair service does not care about object expiration at all. This is a responsibility of another service. Removing this check will also simplify how we migrate this code correctly to the metabase. Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2	2021-02-02 16:13:01 +00:00
Kaloyan Raev	bafc6af992	ci: remove workaround for failing tests Change-Id: I3eb673fae6c81bee17d7437cb870d5f5ba6978d5	2020-12-21 18:07:40 +02:00
Michal Niewrzal	f7a31308db	satellite/repair: enable TestRemoveExpiredSegmentFromQueue test Change adds ability to set `now` time during test for repair. Change-Id: Idb8826b7b58b8789b0abc65817b888ecdc752a3f	2020-12-18 10:58:05 +00:00
Michal Niewrzal	8d3ea9c251	satellite/repair/repairer: implement SegmentRepairer with metabase Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c	2020-12-17 10:47:21 +00:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Moby von Briesen	575f50df84	satellite/repair: Update repair override config to support multiple RS schemes. Rather than having a single repair override value, we will now support repair override values based on a particular segment's RS scheme. The new format for RS override values is "k/o/n-override,k/o/n-override..." Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d	2020-11-23 18:01:15 +00:00
paul cannon	2b59640f18	cmd/satellite: ignore Canceled in exit from repair worker Firstly, this changes the repair functionality to return Canceled errors when a repair is canceled during the Get phase. Previously, because we do not track individual errors per piece, this would just show up as a failure to download enough pieces to repair the segment, which would cause the segment to be added to the IrreparableDB, which is entirely unhelpful. Then, ignore Canceled errors in the return value of the repair worker. Apparently, when the worker returns an error, that makes Cobra exit the program with a nonzero exit code, which causes some piece of our deployment automation to freak out and page people. And when we ask the repair worker to shut down, "canceled" errors are what we _expect_, not an error case. Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435	2020-11-17 21:37:59 +00:00
Cameron Ayer	da9f1f0611	satellite/repair: add monkit counter for segments below minimum required The current monkit reporting for "remote_segments_lost" is not usable for triggering alerts, as it has reported no data. To allow alerting, two new metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req" will increment by zero on each segment unless it is below the minimum required piece count. The two metrics report what is found by the checker and the repairer respectively. Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7	2020-11-11 12:48:23 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Egon Elbre	7ce372c686	satellite/internalpb: add inspectors Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a	2020-10-30 13:28:17 +02:00
Egon Elbre	004e610d0f	satellite/internalpb: move datarepair.pb to internal Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d	2020-10-30 13:28:14 +02:00
Egon Elbre	2268cc1df3	all: fix linter complaints Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95	2020-10-13 15:59:01 +03:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
JT Olio	b872fe52a1	satellite/repair: switch to piecestore.UploadReader Change-Id: Ia99ad2cf5422e6ba1d98b32946740f9cadba7b6d	2020-09-01 09:26:54 -06:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	5dfe27f175	satellite/{repair,overlay}: Use overlay NodeSelectionCache for repair uploads This change removes the overlay function FindStorageNodesForRepair, which skips using the node selection cache and hits the database directly. Otherwise, it is functionally identical to FindStorageNodesForUpload, which checks the node selection cache first. When selecting nodes for PUT_REPAIRs, we now call FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce database load. Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e	2020-08-04 12:50:12 -04:00
Egon Elbre	44f9193404	satellite/orders: make optimal threshold multiplier into an argument It feels weird having a repairer configuration part of order services. Let's have a single source of truth for it. Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b	2020-07-24 16:35:59 +03:00
Cameron Ayer	e14f7a3fb4	satellite/repair: update healthyPieces and unhealthyPieces after CreateGetRepairOrderLimits Inside CreateGetRepairOrderLimits we pass in a list of healthy pieces, but when we query node info from this list we apply the "reliable" filter again. We sometimes end up with nodes which at first were healthy, but then became unhealthy, and thus can be repaired, but we do not update the 'unhealthyPieces' list with these nodes. This causes an error, 'piece to add already exists', as we fail to remove these pieces from the pointer before replacing them with repaired pieces. Change-Id: I6e2445f342ac117ded30351fa7e5e523c9ec26bd	2020-07-23 13:24:46 +00:00
Egon Elbre	d8dcae3075	all: fix error checking Change-Id: Ia0da1bbd6ce695139922f94096c2419281905e32	2020-07-16 19:13:14 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Moby von Briesen	e7e69f383a	satellite/repair/repairer/ec.go: Add monkit tracing for ec repairer Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and ecrepairer.putPiece so we can get more accurate estimates of node performance during repair. Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d	2020-05-29 14:00:45 +00:00
Moby von Briesen	acf8b72cd0	satellite/repair/repairer: cut off long tail when minimum number of required uploads is met This will speed up the Put step of repair by not waiting to time out for a handful of slow nodes, at the expense of a slightly less durable pointer. It will still repair to the optimal threshold, but not every node that is selected will end up in the pointer. Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651	2020-05-28 16:25:28 -04:00

1 2

95 Commits