storj

Author	SHA1	Message	Date
paul cannon	355ea2133b	satellite/audit: remove pieces when audits fail When pieces fail an audit (hard fail, meaning the node acknowledged it did not have the piece or the piece was corrupted), we will now remove those pieces from the segment. Previously, we did not do this, and some node operators were seeing the same missing piece audited over and over again and losing reputation every time. This change will include both verification and reverification audits. It will also apply to pieces found to be bad during repair, if repair-to-reputation reporting is enabled. Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7	2023-06-22 14:19:00 +00:00
paul cannon	c2710cc78d	satellite/audit: improve error handling * Don't use rpcstatus.Unknown as an indicator of dial failure; instead, GetShare now indicates with a per-share field where a failure happened (DialFailure, RequestFailure, NoFailure). Use that information in Verify() to determine how to treat the source node. * Add a test that replaces a storage node with a black hole, so that connections there will always time out. Make sure we handle that case correctly. Refs: https://github.com/storj/storj/issues/5632 Change-Id: I513a53520fb48b7187d4c4d7e14e01e2cfc0a721	2023-05-11 22:55:26 +00:00
JT Olio	4362761fc7	satellite/audit: fix go1.19 dial timeouts and log more Change-Id: Ide17c1b8e0ca8c86f305bea1b4ae553cc4cb60d0	2023-02-28 17:09:47 +00:00
paul cannon	fc905a15f7	satellite/audit: newContainment->containment Now that all the reverification changes have been made and the old code is out of the way, this commit renames the new things back to the old names. Mostly, this involves renaming "newContainment" to "containment" or "NewContainment" to "Containment", but there are a few other renames that have been promised and are carried out here. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I34e2b857ea338acbb8421cdac18b17f2974f233c	2022-12-16 17:59:52 +00:00
paul cannon	0342ca1aa6	satellite/audit: delete now-unused code Now that we are doing scalable piecewise reverifications, the code for handling the old way of doing things (containment, pending audits, reporting, testing) can now be removed. Refs: https://github.com/storj/storj/issues/5230 Change-Id: Ief1a75f423eff682e8f3d57804e343b3409a6631	2022-12-16 14:53:39 +00:00
paul cannon	a66503b444	satellite/audit: Begin using piecewise reverifications This commit pulls the big switch! We have been setting up piecewise reverifications (the workers for which can be scaled independently of the core) for several commits now, and this commit actually begins making use of them. The core of this commit is fairly small, but it requires changing the semantics in all the tests that relate to reverifications, so it ends up being a large change. The changes to the tests are mostly mechanical and repetitive, though, so reviewers needn't worry much. Refs: https://github.com/storj/storj/issues/5230 Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18	2022-12-16 14:21:13 +00:00
paul cannon	47b9134f76	satellite/audit: add IdentifyContainedNodes This method on the Verifier allows the caller to find, out of the nodes holding pieces in a given segment, which ones are contained. This method is not yet being used. It will be in a future commit. Refs: https://github.com/storj/storj/issues/5230 Change-Id: I242cd999913ca4dabbe8a62767ed4869b31fca04	2022-12-13 20:46:43 +00:00
paul cannon	378b8915c4	satellite/{satellitedb,audit}: add NewContainment NewContainment will replace Containment later in this commit chain, but for now it is not yet being used. NewContainment will allow a node to be contained for multiple pending reverify jobs at a time. It is implemented by way of the reverify queue. Refs: https://github.com/storj/storj/issues/5231 Change-Id: I126eda0b3dfc4710a88fe4a5f41780618ec19101	2022-12-07 18:03:37 +00:00
JT Olio	58a9c55f36	mod: bump dependencies - storj.io/common Change-Id: Ib78154acc253a13683495abfdd96d702625fdce8	2022-10-19 17:01:53 +00:00
paul cannon	802ff18bd8	satellite/audit: better handling of piece fetch errors We have an alert on `not_enough_shares_for_audit` which fires too frequently. Every time so far, it has been because of a network blip of some nature on the satellite side. Satellite operators are expected to have other means in place for alerting on network problems and fixing them, so it's not necessary for the audit framework to act in that way. Instead, in this change, we add three new metrics, `audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and `audit_suspected_network_problem`. When an audit fails, and emits `not_enough_shares_for_audit`, we will now determine whether it looks like we are having network problems (most errors are connection failures, possibly also some successful connections which subsequently time out) or whether something else has happened. After this is deployed, we can remove the alert on `not_enough_shares_for_audit` and add new alerts on `audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`. `audit_suspected_network_problem` does not need an alert. Refs: https://github.com/storj/storj/issues/4669 Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb	2022-09-28 17:02:06 +00:00
Yingrong Zhao	1f8f7ebf06	satellite/{audit, reputation}: fix potential nodes reputation status inconsistency The original design had a flaw which can potentially cause discrepancy for nodes reputation status between reputations table and nodes table. In the event of a failure(network issue, db failure, satellite failure, etc.) happens between update to reputations table and update to nodes table, data can be out of sync. This PR tries to fix above issue by passing through node's reputation from the beginning of an audit/repair(this data is from nodes table) to the next update in reputation service. If the updated reputation status from the service is different from the existing node status, the service will try to update nodes table. In the case of a failure, the service will be able to try update nodes table again since it can see the discrepancy of the data. This will allow both tables to be in-sync eventually. Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122	2022-01-06 21:05:59 +00:00
Yingrong Zhao	0b500a30e4	satellite/audit: move audit metrics out of reporter Since we are sharing the reporting logic between repair and audit. We need to remove metric reporting logic in reporter. Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21	2021-09-16 17:58:56 +00:00
Yaroslav Vorobiov	ee4361fe0d	satellite/audit: fix segment stripes length calculation GetRandomStripe function to randomly select a segment stripe to audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`. Since integer divsion truncates it leads to skipping last stripe if its size is less than stripe size. Use `Redundancy.StripeCount` to get correct stripe count. Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b	2021-09-01 13:25:20 +03:00
Yingrong Zhao	b64d8084e1	satellite/audit: fix metric reporting when fail to complete an audit Change-Id: I39df8d4291db35afbba824281cb23438a91c45db	2021-08-31 17:02:30 +00:00
Cameron Ayer	28cb690618	satellite/audit: log error and increment metric if shares cannot be verified If we encounter an error during the infectious error correction, we just add it to the errlist to be logged at the worker level. We want to make sure we know about this if it happens. Give it its own error log and increment a monkit metric. Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27	2021-08-27 15:28:16 +00:00
Cameron Ayer	24e02b6352	satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares Increment a metric so we can get alerts. Wrap the error so we can search the logs for it. Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6	2021-08-26 20:14:05 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Cameron Ayer	adc0fbddfa	satellite/audit: don't fail nodes for audit if not enough pieces downloaded In most situations where we would not get enough shares to complete an audit, something has probably gone wrong like a forgotten delete, and nodes should not be failed. We have an alert when this occurs. Check the logs to see what happened. If we decide the nodes should get audit failures, we can do it manually. Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f	2021-07-20 20:28:18 +00:00
Michał Niewrzał	70e6cdfd06	satellite/audit: move to segmentloop Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51	2021-06-28 11:32:00 +00:00
Michał Niewrzał	8ce619706b	satellite/audit: migrate to new segment_pending_audit table Currently, pending audit is finding segment by segment location (path) because we want to move audit to segmentloop and we will have only StreamID and Position we need to add columns for those fields. Altering existing table can cause issues while migration and deployment. Cleaner choise is to make new table. This change contains migration with new segment_pending_audit table that will replace pending_audits table and adjustments to use new table in the code. Table pending_audits will be dropped with next release. Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f	2021-06-28 13:19:49 +02:00
Jeff Wendling	d674bc9c52	satellite/audit: include failing segment info in logs Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec	2021-06-10 13:47:22 +03:00
Cameron Ayer	53322bb0a7	satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats Until now, whenever audits were recorded we would try to delete the node from containment just in case it exists. Since we now want to treat segment repair downloads as audits, this would erroneously remove nodes from containment, as repair does not go through a Reverify step. With this changeset, (Batch)UpdateStats will not remove nodes from containment. The Reverify method will remove all necessary nodes from containment. Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f	2021-06-01 21:02:44 +00:00
Egon Elbre	910eec8eee	satellite/metainfo: remove MetabaseDB interface Currently the interface is not useful. When we need to vary the implementation for testing purposes we can introduce a local interface for the service/chore that needs it, rather than using the large api. Unfortunately, this requires adding a cleanup callback for tests, there might be a better solution to this problem. Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682	2021-05-13 13:22:14 +00:00
Egon Elbre	69b149a66f	mod: bump uplink uplink stopped using zap, hence some of the private methods needed to be changed. Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5	2021-05-06 14:48:36 +00:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Michał Niewrzał	237782813b	Merge remote-tracking branch 'origin/multipart-upload' Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f	2021-03-23 09:44:49 +01:00
Cameron Ayer	a04495713d	satellite/audit: add missing logs for audit failure conditions Among other conditions, nodes fail audits by returning incorrect data and by reaching the max reverify count, but we weren't logging these events. This commit adds the missing logs. Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f	2021-03-18 17:33:11 +00:00
Michał Niewrzał	67e26aafcd	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: I9b183323cb470185be22f7c648bb76917d2e6fca	2021-03-10 08:53:38 +01:00
Cameron Ayer	a44974a2f9	satellite/audit: fix pointless containment deletions Previously if node was not found in containment, it was given the status, 'skipped'. We later try to delete skipped nodes from containment. To fix this, add a new status called 'remove' to differentiate nodes which should be skipped and nodes which should be deleted. Change-Id: Ic09e62dc9723c89d0c9f968ce68c039114a9d74e	2021-03-02 13:40:18 -05:00
Kaloyan Raev	9aa61245d0	satellite/audits: migrate to metabase Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2	2020-12-17 14:38:48 +02:00
Egon Elbre	12055e7864	all: minor cleanups Change-Id: I4248dbe36a62a223b06135254b32851485a2eec1	2020-12-16 10:47:46 +00:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Kaloyan Raev	53b7fd7b00	satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified We now have the piece hashes verified for all segments on all production satellites. We can remove the code that handles the case where piece hashes are not verified. This would make easier the migration of services from PointerDB to the new metabase. For consistency, PieceHashesVerified is still set to true in PointerDB for new segments. Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183	2020-11-24 11:09:48 +02:00
Kaloyan Raev	1f386db566	cmd/satellite: remove metainfo commands (#3955 )	2020-10-22 13:33:09 +03:00
Kaloyan Raev	1aeb14e65e	satellite/audit: do not delete expired segments A year ago we made the audit service deleting expired segments. Meanwhile, we introduced an expired deletetion sub-service in the metainfo service which sole purpose is deleting expired segments. Therefore, now we are removing this responsibility from the audit service. It will continue to avoid reporting failures on expired segments, but it would not delete them anymore. We do this to cleanup responsibilities in advance of the metainfo refactoring. Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4	2020-10-22 08:24:16 +00:00
paul cannon	360ab17869	satellite/audit: use LastIPAndPort preferentially This preserves the last_ip_and_port field from node lookups through CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later calls to (Verifier).GetShare() can try to use that IP and port. If a connection to the given IP and port cannot be made, or the connection cannot be verified and secured with the target node identity, an attempt is made to connect to the original node address instead. A similar change is not necessary to the other CreateOrderLimits functions, because they already replace node addresses with the cached IP and port as appropriate. We might want to consider making a similar change to CreateGetRepairOrderLimits(), though. The audit situation is unique because the ramifications are especially powerful when we get the address wrong. Failing a single audit can have a heavy cost to a storage node. We need to make extra effort in order to avoid imposing that cost unfairly. Situation 1: If an audit fails because the repair worker failed to make a DNS query (which might well be the fault on the satellite side), and we have last_ip_and_port information available for the target node, it would be unfair not to try connecting to that last_ip_and_port address. Situation 2: If a node has changed addresses recently and the operator correctly changed its DNS entry, but we don't bother querying DNS, it would be unfair to penalize the node for our failure to connect to it. So the audit worker must try both last_ip_and_port _and_ the node address as supplied by the SNO. We elect here to try last_ip_and_port first, on the grounds that (a) it is expected to work in the large majority of cases, and (b) there should not be any security concerns with connecting to an out-or-date address, and (c) avoiding DNS queries on the satellite side helps alleviate satellite operational load. Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200	2020-10-21 13:34:40 +00:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Kaloyan Raev	e7f2ec7ddf	satellite/audit: fix sanity check for verify-piece-hashes command The VerifyPieceHashes method has a sanity check for the number pieces to be removed from the pointer after the audit for verifying the piece hashes. This sanity check failed when we executed the command on the production satellites because the Verify command removes Fails and PendingAudits nodes from the audit report if piece_hashes_verified = false. A new temporary UsedToVerifyPieceHashes flag is added to audits.Verifier. It is set to true only by the verify-piece-hashes command. If the flag is true then the Verify method will always include Fails and PendingAudits nodes in the report. Test case is added to cover this use case. Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8	2020-10-07 17:17:48 +03:00
Kaloyan Raev	b409b53f7f	cmd/satellite: command for verifying piece hashes Jira: https://storjlabs.atlassian.net/browse/PG-69 There are a number of segments with piece_hashes_verified = false in their metadata) on US-Central-1, Europe-West-1, and Asia-East-1 satellites. Most probably, this happened due to a bug we had in the past. We want to verify them before executing the main migration to metabase. This would simplify the main migration to metabase with one less issue to think about. Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236	2020-09-29 10:58:24 +00:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	4f28bf0720	satellite/audit: Do not return errors from Verify or Reverify on segment modified, expired, or deleted If a segment is deleted, is modified, or expires during an audit, this is not problematic, so we should not return errors. Functionally, nothing changes, but our metrics around audit success rate will be improved after this change. Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606	2020-08-26 13:24:00 +00:00
Qweder93	01bb2bd17d	satellite/audit: verifier checks if node made sucess GE before auditing Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80	2020-08-17 23:37:57 +03:00
Egon Elbre	94a09ce20b	all: add missing dots Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a	2020-08-11 17:50:01 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
paul cannon	0c8c11b251	satellite/audit: add not_enough_shares_for_audit counter We have been using the SQL expression `name='(*Verifier).Verify' AND error_name='not enough shares for successful audit'` thus far to detect cases of this problem and alert on them. Unfortunately, since this rarely (hopefully never) happens, influxdb has no data for most of the auditor instances, and when it has no data for a time series, it returns no columns either. This makes Redash upset when it tries to perform a query for an alert and can't find the column whose value it expects to check. This change should make it so zero values are reported when the problem has not happened, and higher values when it has. Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1	2020-04-03 17:00:50 +00:00
littleskunk	23e5a0471f	satellite/audit: clean up logging (#3832 ) Co-authored-by: Ivan Fraixedes <ivan@fraixed.es>	2020-03-30 12:09:50 -06:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Jennifer Johnson	0d60c1a4b2	satellite/audit: fix checkSegmentAltered to detect segments that have changed during an audit - Previously, checkSegmentAltered only checked for segments that were replaced but we want to detect all changes to a segment that occurred while an audit was being conducted. - Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified segments were not being removed from containment mode. - Filled in gaps in reverify testing to ensure nodes are properly removed from containment. Change-Id: Icd96d369278987200fd28581395725438972b292	2020-03-05 19:05:39 +00:00

1 2

77 Commits