storj

Author	SHA1	Message	Date
paul cannon	1b8bd6c082	satellite/repair: unify repair logic The repair checker and repair worker both need to determine which pieces are healthy, which are retrievable, and which should be replaced, but they have been doing it in different ways in different code, which has been the cause of bugs. The same term could have very similar but subtly different meanings between the two, causing much confusion. With this change, the piece- and node-classification logic is consolidated into one place within the satellite/repair package, so that both subsystems can use it. This ought to make decision-making code more concise and more readable. The consolidated classification logic has been expanded to create more sets, so that the decision-making code does not need to do as much precalculation. It should now be clearer in comments and code that a piece can belong to multiple sets arbitrarily (except where the definition of the sets makes this logically impossible), and what the precise meaning of each set is. These sets include Missing, Suspended, Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair, UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy. Some other side effects of this change: * CreatePutRepairOrderLimits no longer needs to special-case excluded countries; it can just create as many order limits as requested (by way of len(newNodes)). * The repair checker will now queue a segment for repair when there are any pieces out of placement. The code calls this "forcing a repair". * The checker.ReliabilityCache is now accessed by way of a GetNodes() function similar to the one on the overlay. The classification methods like MissingPieces(), OutOfPlacementPieces(), and PiecesNodesLastNetsInOrder() are removed in favor of the classification logic in satellite/repair/classification.go. This means the reliability cache no longer needs access to the placement rules or excluded countries list. Change-Id: I105109fb94ee126952f07d747c6e11131164fadb	2023-09-25 09:42:08 -05:00
paul cannon	355ea2133b	satellite/audit: remove pieces when audits fail When pieces fail an audit (hard fail, meaning the node acknowledged it did not have the piece or the piece was corrupted), we will now remove those pieces from the segment. Previously, we did not do this, and some node operators were seeing the same missing piece audited over and over again and losing reputation every time. This change will include both verification and reverification audits. It will also apply to pieces found to be bad during repair, if repair-to-reputation reporting is enabled. Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7	2023-06-22 14:19:00 +00:00
JT Olio	36038af3d1	go.mod: bump storj.io/common Change-Id: I18d4d491cdca99f6694a20f3ead5fba3f75bb658	2023-06-02 11:23:02 -04:00
paul cannon	c2710cc78d	satellite/audit: improve error handling * Don't use rpcstatus.Unknown as an indicator of dial failure; instead, GetShare now indicates with a per-share field where a failure happened (DialFailure, RequestFailure, NoFailure). Use that information in Verify() to determine how to treat the source node. * Add a test that replaces a storage node with a black hole, so that connections there will always time out. Make sure we handle that case correctly. Refs: https://github.com/storj/storj/issues/5632 Change-Id: I513a53520fb48b7187d4c4d7e14e01e2cfc0a721	2023-05-11 22:55:26 +00:00
Michal Niewrzal	36e046375c	satellite/repair/checker: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I8583549973cd36aa0e0c482c20d7a75cb7568ab3	2023-05-08 12:19:13 +00:00
Michal Niewrzal	1aa24b9f0d	satellite/audit: remove segments loop parts We are switching completely to ranged loop. https://github.com/storj/storj/issues/5368 Change-Id: I9cec0ac454f40f19d52c078a8b1870c4d192bd7a	2023-04-24 15:52:11 +00:00
Egon Elbre	8fba740332	storagenode/blobstore/testblobs: don't error checks in BadDB We automatically start a chore to check whether the blobstore is writeable and readable, however, we don't want to fail the tests due to that reason. Usually we want to test some other failure. There probably should be some nicer way to achieve this, but this is an easier fix. Change-Id: I77ada75329f88d3ea52edd2022e811e337c5255a	2023-04-11 16:14:31 +03:00
Egon Elbre	f5020de57c	storagenode/blobstore: move blob store logic The blobstore implementation is entirely related to storagenode, so the rightful place is together with the storagenode implementation. Fixes https://github.com/storj/storj/issues/5754 Change-Id: Ie6637b0262cf37af6c3e558556c7604d9dc3613d	2023-04-05 18:06:20 +00:00
paul cannon	ab2e793555	satellite/audit: test delay before Reverify We are supposed to wait for some amount of time after a timed-out audit before retrying the audit on the contained node. We are also supposed to wait for some amount of time before subsequent retries, if they are necessary. The test added here tries to assure that those delays happen, as far as it is possible to assure that a delay will happen in computer code. The previous behavior of the system was, in fact, to carry out Reverifies as soon as a worker could retrieve the job from the reverification queue. That's not a very major problem, as subsequent retries do have a delay and the node does get several retries. Still, it was not ideal, and this test exposed that mismatch with expectations, so this commit includes a minor change to effect that pause between verify and the first reverify. Refs: https://github.com/storj/storj/issues/5499 Change-Id: I83bb79c166a458ba59a2db2d17c85eca43ca90f0	2023-02-15 23:16:23 +00:00
paul cannon	362d73f783	satellite/audit: add tests for concurrent audits This commit introduces tests that perform multiple concurrent audits against the same storage node, to make sure that doing so does not create incorrect outcomes. Refs: https://github.com/storj/storj/issues/5495 Change-Id: Iaae49e042306bfa59bdf04c1a1540667488e51e5	2023-02-09 19:58:50 +00:00
Andrew Harding	590d44301c	satellite/audit: implement rangedloop observer This change implements the ranged loop observer to replace the audit chore that builds the audit queue. The strategy employed by this change is to use a collector for each segment range to build separate per-node segment reservoirs that are then merge them during the join step. In previous observer migrations, there were only a handful of tests so the strategy was to duplicate them. In this package, there are dozens of tests that utilize the chore. To reduce code churn and maintenance burden until the chore is removed, this change introduces a helper that runs tests under both the chore and observer, providing a pair of functions that can be used to pause or run the queueing function. https://github.com/storj/storj/issues/5232 Change-Id: I8bb4b4e55cf98b1aac9f26307e3a9a355cb3f506	2023-01-03 08:52:01 -07:00
paul cannon	fc905a15f7	satellite/audit: newContainment->containment Now that all the reverification changes have been made and the old code is out of the way, this commit renames the new things back to the old names. Mostly, this involves renaming "newContainment" to "containment" or "NewContainment" to "Containment", but there are a few other renames that have been promised and are carried out here. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I34e2b857ea338acbb8421cdac18b17f2974f233c	2022-12-16 17:59:52 +00:00
paul cannon	0342ca1aa6	satellite/audit: delete now-unused code Now that we are doing scalable piecewise reverifications, the code for handling the old way of doing things (containment, pending audits, reporting, testing) can now be removed. Refs: https://github.com/storj/storj/issues/5230 Change-Id: Ief1a75f423eff682e8f3d57804e343b3409a6631	2022-12-16 14:53:39 +00:00
paul cannon	a66503b444	satellite/audit: Begin using piecewise reverifications This commit pulls the big switch! We have been setting up piecewise reverifications (the workers for which can be scaled independently of the core) for several commits now, and this commit actually begins making use of them. The core of this commit is fairly small, but it requires changing the semantics in all the tests that relate to reverifications, so it ends up being a large change. The changes to the tests are mostly mechanical and repetitive, though, so reviewers needn't worry much. Refs: https://github.com/storj/storj/issues/5230 Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18	2022-12-16 14:21:13 +00:00
paul cannon	47b9134f76	satellite/audit: add IdentifyContainedNodes This method on the Verifier allows the caller to find, out of the nodes holding pieces in a given segment, which ones are contained. This method is not yet being used. It will be in a future commit. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I242cd999913ca4dabbe8a62767ed4869b31fca04	2022-12-13 20:46:43 +00:00
paul cannon	378b8915c4	satellite/{satellitedb,audit}: add NewContainment NewContainment will replace Containment later in this commit chain, but for now it is not yet being used. NewContainment will allow a node to be contained for multiple pending reverify jobs at a time. It is implemented by way of the reverify queue. Refs: https://github.com/storj/storj/issues/5231 Change-Id: I126eda0b3dfc4710a88fe4a5f41780618ec19101	2022-12-07 18:03:37 +00:00
paul cannon	8b494f3740	satellite/audit: use db for auditor queue As part of the effort of splitting out the auditor workers to their own process, we are transitioning the communication between the auditor chore and the verification workers to a queue implemented in the database, rather than the sequence of in-memory queues we used to use. This logical database is safely partitionable from the rest of satelliteDB. Refs: https://github.com/storj/storj/issues/5251 Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc	2022-11-22 14:04:00 +00:00
Egon Elbre	8b70f969b6	all: fix nolint directives Change-Id: I261c8b12e4961e6401cc4024fa5abc35b1a5efa6	2022-10-11 18:31:20 +00:00
paul cannon	2f20bbf4d8	satellite/reputation: add a reputation write cache This should lower the amount of database load coming from reputation updates. Change-Id: Iaacfb81480075261da77c5cc93e08b24f69f8949	2022-07-14 21:40:16 +00:00
Egon Elbre	b8006e192b	satellite/audit: use a larger delay in the test The MinDownloadTimeout 950ms and delay of 1s were quiet close, possibly causing flaky behavior in TestVerifierSlowDownload. Change-Id: I4f6c1554a118b21427357642abe39986fd0af38d	2022-06-28 17:03:23 +03:00
Fadila Khadar	29fd36a20e	satellite/repairer: handle excluded countries For nodes in excluded areas, we don't necessarily want to remove them from the pointer, but we do want to increase the number of pieces in the segment in case those excluded area nodes go down. To do that, we increase the number of pieces repaired by the number of pieces in excluded areas. Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a	2022-03-14 10:59:36 -04:00
Mya	05a17ef42d	deps: upgrade storj.io/common In addition to upgrading the storj.io/common library, this change moves off the TCPConnector in favor of the HybridConnector per the deprecation warning. Change-Id: I7e7e1e7568e8b95e4a99ad9caa158a799e68e1e3	2022-02-16 18:59:19 +00:00
Yingrong Zhao	1f8f7ebf06	satellite/{audit, reputation}: fix potential nodes reputation status inconsistency The original design had a flaw which can potentially cause discrepancy for nodes reputation status between reputations table and nodes table. In the event of a failure(network issue, db failure, satellite failure, etc.) happens between update to reputations table and update to nodes table, data can be out of sync. This PR tries to fix above issue by passing through node's reputation from the beginning of an audit/repair(this data is from nodes table) to the next update in reputation service. If the updated reputation status from the service is different from the existing node status, the service will try to update nodes table. In the case of a failure, the service will be able to try update nodes table again since it can see the discrepancy of the data. This will allow both tables to be in-sync eventually. Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122	2022-01-06 21:05:59 +00:00
Yingrong Zhao	0b500a30e4	satellite/audit: move audit metrics out of reporter Since we are sharing the reporting logic between repair and audit. We need to remove metric reporting logic in reporter. Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21	2021-09-16 17:58:56 +00:00
Michał Niewrzał	c258f4bbac	private/testplanet: move Metabase outside Metainfo for satellite At some point we moved metabase package outside Metainfo but we didn't do that for satellite structure. This change refactors only tests. When uplink will be adjusted we can remove old entries in Metainfo struct. Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb	2021-09-09 07:15:51 +00:00
Yingrong Zhao	b64d8084e1	satellite/audit: fix metric reporting when fail to complete an audit Change-Id: I39df8d4291db35afbba824281cb23438a91c45db	2021-08-31 17:02:30 +00:00
Cameron Ayer	adc0fbddfa	satellite/audit: don't fail nodes for audit if not enough pieces downloaded In most situations where we would not get enough shares to complete an audit, something has probably gone wrong like a forgotten delete, and nodes should not be failed. We have an alert when this occurs. Check the logs to see what happened. If we decide the nodes should get audit failures, we can do it manually. Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f	2021-07-20 20:28:18 +00:00
Michał Niewrzał	70e6cdfd06	satellite/audit: move to segmentloop Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51	2021-06-28 11:32:00 +00:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Michał Niewrzał	27ae0d1f15	satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy. Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576	2021-03-22 08:12:56 +00:00
Kaloyan Raev	9aa61245d0	satellite/audits: migrate to metabase Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2	2020-12-17 14:38:48 +02:00
Kaloyan Raev	53b7fd7b00	satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified We now have the piece hashes verified for all segments on all production satellites. We can remove the code that handles the case where piece hashes are not verified. This would make easier the migration of services from PointerDB to the new metabase. For consistency, PieceHashesVerified is still set to true in PointerDB for new segments. Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183	2020-11-24 11:09:48 +02:00
Moby von Briesen	db6bc6503d	satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes. Make metainfo.RSConfig a valid pflag config value. This allows us to configure the RSConfig as a string like k/m/o/n-shareSize, which makes having multiple supported RS schemes easier in the future. RS-related config values that are no longer needed have been removed (MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify). Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1	2020-11-09 22:16:13 +00:00
Kaloyan Raev	1f386db566	cmd/satellite: remove metainfo commands (#3955 )	2020-10-22 13:33:09 +03:00
Kaloyan Raev	1aeb14e65e	satellite/audit: do not delete expired segments A year ago we made the audit service deleting expired segments. Meanwhile, we introduced an expired deletetion sub-service in the metainfo service which sole purpose is deleting expired segments. Therefore, now we are removing this responsibility from the audit service. It will continue to avoid reporting failures on expired segments, but it would not delete them anymore. We do this to cleanup responsibilities in advance of the metainfo refactoring. Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4	2020-10-22 08:24:16 +00:00
paul cannon	360ab17869	satellite/audit: use LastIPAndPort preferentially This preserves the last_ip_and_port field from node lookups through CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later calls to (Verifier).GetShare() can try to use that IP and port. If a connection to the given IP and port cannot be made, or the connection cannot be verified and secured with the target node identity, an attempt is made to connect to the original node address instead. A similar change is not necessary to the other CreateOrderLimits functions, because they already replace node addresses with the cached IP and port as appropriate. We might want to consider making a similar change to CreateGetRepairOrderLimits(), though. The audit situation is unique because the ramifications are especially powerful when we get the address wrong. Failing a single audit can have a heavy cost to a storage node. We need to make extra effort in order to avoid imposing that cost unfairly. Situation 1: If an audit fails because the repair worker failed to make a DNS query (which might well be the fault on the satellite side), and we have last_ip_and_port information available for the target node, it would be unfair not to try connecting to that last_ip_and_port address. Situation 2: If a node has changed addresses recently and the operator correctly changed its DNS entry, but we don't bother querying DNS, it would be unfair to penalize the node for our failure to connect to it. So the audit worker must try both last_ip_and_port _and_ the node address as supplied by the SNO. We elect here to try last_ip_and_port first, on the grounds that (a) it is expected to work in the large majority of cases, and (b) there should not be any security concerns with connecting to an out-or-date address, and (c) avoiding DNS queries on the satellite side helps alleviate satellite operational load. Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200	2020-10-21 13:34:40 +00:00
Kaloyan Raev	e7f2ec7ddf	satellite/audit: fix sanity check for verify-piece-hashes command The VerifyPieceHashes method has a sanity check for the number pieces to be removed from the pointer after the audit for verifying the piece hashes. This sanity check failed when we executed the command on the production satellites because the Verify command removes Fails and PendingAudits nodes from the audit report if piece_hashes_verified = false. A new temporary UsedToVerifyPieceHashes flag is added to audits.Verifier. It is set to true only by the verify-piece-hashes command. If the flag is true then the Verify method will always include Fails and PendingAudits nodes in the report. Test case is added to cover this use case. Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8	2020-10-07 17:17:48 +03:00
Yingrong Zhao	c085a17a52	bump common and uplink to latest Change-Id: I717f0214dd9973acd51b7732c5d64587f610c805	2020-10-01 15:38:58 +00:00
Kaloyan Raev	b409b53f7f	cmd/satellite: command for verifying piece hashes Jira: https://storjlabs.atlassian.net/browse/PG-69 There are a number of segments with piece_hashes_verified = false in their metadata) on US-Central-1, Europe-West-1, and Asia-East-1 satellites. Most probably, this happened due to a bug we had in the past. We want to verify them before executing the main migration to metabase. This would simplify the main migration to metabase with one less issue to think about. Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236	2020-09-29 10:58:24 +00:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
Moby von Briesen	5d21e85529	satellite/audit/queue: Separate audit queue into two separate structs. * The audit worker wants to get items from the queue and process them. * The audit chore wants to create new queues and swap them in when the old queue has been processed. This change adds a "Queues" struct which handles the concurrency issues around the worker fetching a queue and the chore swapping a new queue in. It simplifies the logic of the "Queue" struct to its bare bones, so that it behaves like a normal queue with no need to understand the details of swapping and worker/chore interactions. Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b	2020-08-31 20:51:25 +00:00
Egon Elbre	c86c732fc0	satellite: simplify tests satellite.DB.Console().Projects().GetAll database query can be replaced with planet.Uplinks[0].Projects[0].ID Change-Id: I73b82b91afb2dde7b690917345b798f9d81f6831	2020-08-28 22:28:04 +00:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	4f28bf0720	satellite/audit: Do not return errors from Verify or Reverify on segment modified, expired, or deleted If a segment is deleted, is modified, or expires during an audit, this is not problematic, so we should not return errors. Functionally, nothing changes, but our metrics around audit success rate will be improved after this change. Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606	2020-08-26 13:24:00 +00:00
Moby von Briesen	76030a8237	satellite/audit/{queue,chore}: Wait for audit queue to be finished before swapping * Do not swap the active audit queue with the pending audit queue until the active audit queue is empty. * Do not begin creating a new pending audit queue until the existing pending audit queue has been swapped to the active queue. Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317	2020-07-28 16:56:26 +00:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Egon Elbre	bcd93ee375	private/testplanet: add StopNodeAndUpdate This was commonly used and code with it can be simplified. Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39	2020-05-08 13:02:19 +00:00
Isaac Hess	5a85e8d749	satellite/audit: Fix flaky TestVerifierSlowDownload TestVerifierSlowDownload would sometimes not have enough nodes finish in the allotted deadline period. This increases the deadline and also does not assert that exactly 3 have finished. Instead, in keeping with the purpose of the test, it asserts that the slow download is never counted as a success and is always counted as a pending audit in the final report. Change-Id: I180734fcc4a499420c75164bad6253ed155d87de	2020-04-30 15:58:47 -06:00
Egon Elbre	11a44cdd88	all: don't depend on gogo/proto directly Change-Id: I8822dea0d1b7b99e0b828e0373a0308a42dde2be	2020-04-08 17:32:15 +00:00
Egon Elbre	e1a443b04a	private/testplanet: allow modifying created database Instead of providing the database from outside to testplanet create it inside and then allow wrapping and modifying it. This is more convenient to use. Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c	2020-03-27 19:14:48 +00:00

1 2

68 Commits