storj

Author	SHA1	Message	Date
Egon Elbre	b8006e192b	satellite/audit: use a larger delay in the test The MinDownloadTimeout 950ms and delay of 1s were quiet close, possibly causing flaky behavior in TestVerifierSlowDownload. Change-Id: I4f6c1554a118b21427357642abe39986fd0af38d	2022-06-28 17:03:23 +03:00
paul cannon	737d7c7dfc	satellite/reputation: new ApplyUpdates() method The ApplyUpdates() method on the reputation.DB interface acts like the similar Update() method, but can allow for applying the changes from multiple audit events, instead of only one. This will be necessary for the reputation write cache, which will batch up changes to each node's reputation in order to flush them periodically. Refs: https://github.com/storj/storj/issues/4601 Change-Id: I44cc47767ea2d9423166bb8fed080c8a11182041	2022-06-07 15:22:25 +00:00
paul cannon	fd01c6cc25	satellite/{repair,audit}: simplify reputation reporter Also, make it an interface so that the upcoming write cache can be dropped in to the same place. Change-Id: I2c286743825e647c0cef5b6578245391851fa10c	2022-05-10 14:04:43 +00:00
Michał Niewrzał	307295977d	satellite/{audit,metrics}: optimize loop methods What was applied: * avoid extra map lookups * reorganize monikit for less cpu usage Change-Id: I70575f404f717f7905b27d43888cbd7489f0176d	2022-05-05 15:10:56 +00:00
Yaroslav Vorobiov	3f47d19aa6	satellite/overlay: add disqualification reason Add disqualification reason to NodeDossier. Extend DB.DisqualifyNode with disqualification reason. Extend reputation Service.TestDisqualifyNode with disqualification reason. Change-Id: I8611b6340c7f42ac1bb8bd0fd7f0648ad650ab2d	2022-04-20 13:29:31 +00:00
Erik van Velzen	86d742f7c6	satellite/audit: verify auditing of copies Check that audit works in the face of copies. Closes https://github.com/storj/storj/issues/4695 Change-Id: I1ee79a73c28e3f4842eebe8c4e4cd9ecf2e51e57	2022-04-12 15:24:54 +00:00
Fadila Khadar	29fd36a20e	satellite/repairer: handle excluded countries For nodes in excluded areas, we don't necessarily want to remove them from the pointer, but we do want to increase the number of pieces in the segment in case those excluded area nodes go down. To do that, we increase the number of pieces repaired by the number of pieces in excluded areas. Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a	2022-03-14 10:59:36 -04:00
Mya	05a17ef42d	deps: upgrade storj.io/common In addition to upgrading the storj.io/common library, this change moves off the TCPConnector in favor of the HybridConnector per the deprecation warning. Change-Id: I7e7e1e7568e8b95e4a99ad9caa158a799e68e1e3	2022-02-16 18:59:19 +00:00
Yingrong Zhao	1f8f7ebf06	satellite/{audit, reputation}: fix potential nodes reputation status inconsistency The original design had a flaw which can potentially cause discrepancy for nodes reputation status between reputations table and nodes table. In the event of a failure(network issue, db failure, satellite failure, etc.) happens between update to reputations table and update to nodes table, data can be out of sync. This PR tries to fix above issue by passing through node's reputation from the beginning of an audit/repair(this data is from nodes table) to the next update in reputation service. If the updated reputation status from the service is different from the existing node status, the service will try to update nodes table. In the case of a failure, the service will be able to try update nodes table again since it can see the discrepancy of the data. This will allow both tables to be in-sync eventually. Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122	2022-01-06 21:05:59 +00:00
dlamarmorgan	b3cea3d1b6	satellite/audit: account for piece size during audit reservoir sampling Treat the piece size as a weight, and perform weighted reservoir sampling as given in Algorithm A-Chao (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Chao) Change-Id: I299d0026d9e02d03b3d2130b0f32192928e6e326	2021-12-01 18:17:52 +00:00
Egon Elbre	8eebbf3d7d	satellite/audit: fix TestReverify timeouts Currently the slow db was sleeping for 1s and the timeout for audit was 1s. There's a slight chance that the timeout won't trigger on such a small difference. Increase the slow node sleep to 10x of the timeout. Hopefully fixes #4268 Change-Id: Ifdab45141b3fc7c62bde11813dbc534b3255fe59	2021-11-09 13:16:29 +00:00
Cameron Ayer	56fe636123	satellite/{reputation/satellitedb}: remove references to contained column in reputations table We don't use this column for anything. If you want to know if a node is contained, you can check the pending_audits table. Change-Id: I5671722a5fc6e1749d3a49e187a56556000ff941	2021-10-14 19:59:03 +00:00
Cameron Ayer	bb21551a9c	satellite/satellitedb: remove references to contained column in nodes table We don't use this column for anything. If you want to know if a node is contained, you can check the pending_audits table. Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04	2021-10-14 19:17:46 +00:00
Michał Niewrzał	1ed5db1467	satellite/metainfo: simplifying limits code Its a very simple change to reduct code duplication. Change-Id: Ia135232e3aefd094f76c6988e82e297be028e174	2021-09-28 06:22:13 +00:00
Yaroslav Vorobiov	469ae72c19	satellite/repair: update audit records during repair Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0	2021-09-24 00:48:13 +00:00
Yingrong Zhao	0b500a30e4	satellite/audit: move audit metrics out of reporter Since we are sharing the reporting logic between repair and audit. We need to remove metric reporting logic in reporter. Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21	2021-09-16 17:58:56 +00:00
Egon Elbre	1aec831d98	satellite/audit,storage: increase sleep delay in TestMaxVerifyCount Currently TextMaxVerifyCount flakes in some tests, try increasing the sleep time to ensure that things are slow enough to trigger the error condition. Also pass ctx to all the funcs so we can handle sleep better. Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00	2021-09-10 15:30:37 +00:00
Michał Niewrzał	c258f4bbac	private/testplanet: move Metabase outside Metainfo for satellite At some point we moved metabase package outside Metainfo but we didn't do that for satellite structure. This change refactors only tests. When uplink will be adjusted we can remove old entries in Metainfo struct. Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb	2021-09-09 07:15:51 +00:00
Yaroslav Vorobiov	ee4361fe0d	satellite/audit: fix segment stripes length calculation GetRandomStripe function to randomly select a segment stripe to audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`. Since integer divsion truncates it leads to skipping last stripe if its size is less than stripe size. Use `Redundancy.StripeCount` to get correct stripe count. Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b	2021-09-01 13:25:20 +03:00
Yingrong Zhao	b64d8084e1	satellite/audit: fix metric reporting when fail to complete an audit Change-Id: I39df8d4291db35afbba824281cb23438a91c45db	2021-08-31 17:02:30 +00:00
Cameron Ayer	28cb690618	satellite/audit: log error and increment metric if shares cannot be verified If we encounter an error during the infectious error correction, we just add it to the errlist to be logged at the worker level. We want to make sure we know about this if it happens. Give it its own error log and increment a monkit metric. Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27	2021-08-27 15:28:16 +00:00
Cameron Ayer	24e02b6352	satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares Increment a metric so we can get alerts. Wrap the error so we can search the logs for it. Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6	2021-08-26 20:14:05 +00:00
Cameron Ayer	5a1a29a62e	satellite/audit: fix containment bug where nodes not removed When a node gets enough timeouts, it is supposed to be removed from pending_audits and get an audit failure. We would give them a failure, but we missed the removal. This change fixes it. Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b	2021-08-20 14:48:27 +00:00
Cameron Ayer	70296c5050	satellite/audit: change wording of audit worker error log "audit failed" is already used when a node fails an audit. That makes searching for this higher level audit worker error more difficult. Additionally, the presence of errors from the audit worker doesn't necessarily mean the audit failed. Reword the error message to "error(s) during audit" Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6	2021-08-20 13:27:16 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Yingrong Zhao	58238d850c	satellite/{audit, accounting}: use reputation store in tests Change-Id: I86a8ccf5dcee8d108196a9f67a476fe0ccbd8257	2021-07-28 13:21:55 -04:00
Yingrong Zhao	6c7bf357cd	satellite/{reputation,audit,overlay}: replace overlay with reputation package in audit This PR implements reputation store and replace overlay in audit service to use such store for storing node's audit stats. In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to reputation package. In a following PR, the duplicating code will be removed from overlay. Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1	2021-07-28 13:10:48 -04:00
Cameron Ayer	adc0fbddfa	satellite/audit: don't fail nodes for audit if not enough pieces downloaded In most situations where we would not get enough shares to complete an audit, something has probably gone wrong like a forgotten delete, and nodes should not be failed. We have an alert when this occurs. Check the logs to see what happened. If we decide the nodes should get audit failures, we can do it manually. Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f	2021-07-20 20:28:18 +00:00
Michał Niewrzał	70e6cdfd06	satellite/audit: move to segmentloop Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51	2021-06-28 11:32:00 +00:00
Michał Niewrzał	8ce619706b	satellite/audit: migrate to new segment_pending_audit table Currently, pending audit is finding segment by segment location (path) because we want to move audit to segmentloop and we will have only StreamID and Position we need to add columns for those fields. Altering existing table can cause issues while migration and deployment. Cleaner choise is to make new table. This change contains migration with new segment_pending_audit table that will replace pending_audits table and adjustments to use new table in the code. Table pending_audits will be dropped with next release. Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f	2021-06-28 13:19:49 +02:00
JT Olio	6949dc0bac	satellite/metaloop: missing monitoring on observers Change-Id: I630fbb0448c8d08b426486b3e49abfbca03332a6	2021-06-15 13:39:13 +00:00
Jeff Wendling	d674bc9c52	satellite/audit: include failing segment info in logs Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec	2021-06-10 13:47:22 +03:00
Jeff Wendling	944bceabcd	satellite/audit: fix reservoir sampling bias Change-Id: Icc522fd86538b8182a1b7d42c1588c32a257acaf	2021-06-10 13:47:22 +03:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Cameron Ayer	53322bb0a7	satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats Until now, whenever audits were recorded we would try to delete the node from containment just in case it exists. Since we now want to treat segment repair downloads as audits, this would erroneously remove nodes from containment, as repair does not go through a Reverify step. With this changeset, (Batch)UpdateStats will not remove nodes from containment. The Reverify method will remove all necessary nodes from containment. Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f	2021-06-01 21:02:44 +00:00
Egon Elbre	10a0216af5	satellite/metainfo: use range for specifying download limit Previously the object range was not used for calculating order limit. This meant that even if you were downloading only a small range it would account bandwidth based on the full segment. This doesn't fully address the accounting since the lazy segment downloads do not send their requested range nor requested limit. Change-Id: Ic811e570c889be87bac4293547d6537a255078da	2021-06-01 09:36:55 +00:00
Egon Elbre	910eec8eee	satellite/metainfo: remove MetabaseDB interface Currently the interface is not useful. When we need to vary the implementation for testing purposes we can introduce a local interface for the service/chore that needs it, rather than using the large api. Unfortunately, this requires adding a cleanup callback for tests, there might be a better solution to this problem. Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682	2021-05-13 13:22:14 +00:00
Egon Elbre	69b149a66f	mod: bump uplink uplink stopped using zap, hence some of the private methods needed to be changed. Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5	2021-05-06 14:48:36 +00:00
Cameron Ayer	bb343d9028	satellite/satellitedb: don't remove offline nodes from containment When audits are being recorded, we automatically add some SQL to remove the node from the pending audits table in case it exists. They are removed from pending audits even if the node was offline for the audit. This is not the correct behavior. Add statement to record audit results in reverify tests to ensure no more false positives. Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17	2021-05-03 16:05:55 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Egon Elbre	4c9ed64f75	satellite/metabase/metaloop: move loop under metabase Currently the loop handling is heavily related to the metabase rather than metainfo. metainfo over time has become related to the "public API" for accessing the metabase data. Currently updates monkit.lock, because monkit monitoring does not handle ScopeNamed correctly. Needs a followup change to monitoring check. Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584	2021-04-22 12:58:09 +03:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Fadila Khadar	bde367ae73	satellite/gc: check on bloom filter creation date Check that the bloom filter creation date is earlier than the metainfo loop system time used for db scanning. Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4	2021-04-14 16:40:37 +00:00
Egon Elbre	f19ef4afe5	satellite/metainfo/metaloop: move loop to a separate package Change-Id: I94c931a27c1af6062185ec62688624ec02050f11	2021-03-23 15:37:34 +00:00
Michał Niewrzał	237782813b	Merge remote-tracking branch 'origin/multipart-upload' Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f	2021-03-23 09:44:49 +01:00
Michał Niewrzał	27ae0d1f15	satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy. Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576	2021-03-22 08:12:56 +00:00
Cameron Ayer	a04495713d	satellite/audit: add missing logs for audit failure conditions Among other conditions, nodes fail audits by returning incorrect data and by reaching the max reverify count, but we weren't logging these events. This commit adds the missing logs. Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f	2021-03-18 17:33:11 +00:00
Michał Niewrzał	67e26aafcd	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: I9b183323cb470185be22f7c648bb76917d2e6fca	2021-03-10 08:53:38 +01:00
Cameron Ayer	a44974a2f9	satellite/audit: fix pointless containment deletions Previously if node was not found in containment, it was given the status, 'skipped'. We later try to delete skipped nodes from containment. To fix this, add a new status called 'remove' to differentiate nodes which should be skipped and nodes which should be deleted. Change-Id: Ic09e62dc9723c89d0c9f968ce68c039114a9d74e	2021-03-02 13:40:18 -05:00
Kaloyan Raev	9aa61245d0	satellite/audits: migrate to metabase Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2	2020-12-17 14:38:48 +02:00

1 2 3 4

160 Commits