storj

Author	SHA1	Message	Date
Michał Niewrzał	1ed5db1467	satellite/metainfo: simplifying limits code Its a very simple change to reduct code duplication. Change-Id: Ia135232e3aefd094f76c6988e82e297be028e174	2021-09-28 06:22:13 +00:00
Yaroslav Vorobiov	469ae72c19	satellite/repair: update audit records during repair Change-Id: I788b2096968f043601aba6502a2e4e784f1f02a0	2021-09-24 00:48:13 +00:00
Yingrong Zhao	0b500a30e4	satellite/audit: move audit metrics out of reporter Since we are sharing the reporting logic between repair and audit. We need to remove metric reporting logic in reporter. Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21	2021-09-16 17:58:56 +00:00
Egon Elbre	1aec831d98	satellite/audit,storage: increase sleep delay in TestMaxVerifyCount Currently TextMaxVerifyCount flakes in some tests, try increasing the sleep time to ensure that things are slow enough to trigger the error condition. Also pass ctx to all the funcs so we can handle sleep better. Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00	2021-09-10 15:30:37 +00:00
Michał Niewrzał	c258f4bbac	private/testplanet: move Metabase outside Metainfo for satellite At some point we moved metabase package outside Metainfo but we didn't do that for satellite structure. This change refactors only tests. When uplink will be adjusted we can remove old entries in Metainfo struct. Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb	2021-09-09 07:15:51 +00:00
Yaroslav Vorobiov	ee4361fe0d	satellite/audit: fix segment stripes length calculation GetRandomStripe function to randomly select a segment stripe to audit was using `segment.EncryptedSize/segment.Redundancy.StripeSize()`. Since integer divsion truncates it leads to skipping last stripe if its size is less than stripe size. Use `Redundancy.StripeCount` to get correct stripe count. Change-Id: Ida09e035be30a21219ab3e1aedd66af8be707d1b	2021-09-01 13:25:20 +03:00
Yingrong Zhao	b64d8084e1	satellite/audit: fix metric reporting when fail to complete an audit Change-Id: I39df8d4291db35afbba824281cb23438a91c45db	2021-08-31 17:02:30 +00:00
Cameron Ayer	28cb690618	satellite/audit: log error and increment metric if shares cannot be verified If we encounter an error during the infectious error correction, we just add it to the errlist to be logged at the worker level. We want to make sure we know about this if it happens. Give it its own error log and increment a monkit metric. Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27	2021-08-27 15:28:16 +00:00
Cameron Ayer	24e02b6352	satellite/{audit,orders}: if not enough nodes for audit order limits, increment metric and wrap error with ErrNotEnoughShares Increment a metric so we can get alerts. Wrap the error so we can search the logs for it. Change-Id: I3827aa306c431009828014d9d9afff8dfc057ee6	2021-08-26 20:14:05 +00:00
Cameron Ayer	5a1a29a62e	satellite/audit: fix containment bug where nodes not removed When a node gets enough timeouts, it is supposed to be removed from pending_audits and get an audit failure. We would give them a failure, but we missed the removal. This change fixes it. Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b	2021-08-20 14:48:27 +00:00
Cameron Ayer	70296c5050	satellite/audit: change wording of audit worker error log "audit failed" is already used when a node fails an audit. That makes searching for this higher level audit worker error more difficult. Additionally, the presence of errors from the audit worker doesn't necessarily mean the audit failed. Reword the error message to "error(s) during audit" Change-Id: I0aab12c73c18d4bd962c5d8ac8a17cabcec022e6	2021-08-20 13:27:16 +00:00
Cameron Ayer	a8f125c671	satellite:{audit,repair}: log additional info when we can't download enough pieces When we can't complete an audit or repair, we need more information about what happened during each individual share/piece download. In audit, add the number of offline, unknown, contained, failed nodes to the error log. In repair, combine the errors from each download and add them to the error log. Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba	2021-08-09 22:57:49 +00:00
Yingrong Zhao	58238d850c	satellite/{audit, accounting}: use reputation store in tests Change-Id: I86a8ccf5dcee8d108196a9f67a476fe0ccbd8257	2021-07-28 13:21:55 -04:00
Yingrong Zhao	6c7bf357cd	satellite/{reputation,audit,overlay}: replace overlay with reputation package in audit This PR implements reputation store and replace overlay in audit service to use such store for storing node's audit stats. In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to reputation package. In a following PR, the duplicating code will be removed from overlay. Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1	2021-07-28 13:10:48 -04:00
Cameron Ayer	adc0fbddfa	satellite/audit: don't fail nodes for audit if not enough pieces downloaded In most situations where we would not get enough shares to complete an audit, something has probably gone wrong like a forgotten delete, and nodes should not be failed. We have an alert when this occurs. Check the logs to see what happened. If we decide the nodes should get audit failures, we can do it manually. Change-Id: Ib6e408082048d31197c37ebfd7f9031135fc938f	2021-07-20 20:28:18 +00:00
Michał Niewrzał	70e6cdfd06	satellite/audit: move to segmentloop Change-Id: I10e63a1e4b6b62f5cd3098f5922ad3de1ec5af51	2021-06-28 11:32:00 +00:00
Michał Niewrzał	8ce619706b	satellite/audit: migrate to new segment_pending_audit table Currently, pending audit is finding segment by segment location (path) because we want to move audit to segmentloop and we will have only StreamID and Position we need to add columns for those fields. Altering existing table can cause issues while migration and deployment. Cleaner choise is to make new table. This change contains migration with new segment_pending_audit table that will replace pending_audits table and adjustments to use new table in the code. Table pending_audits will be dropped with next release. Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f	2021-06-28 13:19:49 +02:00
JT Olio	6949dc0bac	satellite/metaloop: missing monitoring on observers Change-Id: I630fbb0448c8d08b426486b3e49abfbca03332a6	2021-06-15 13:39:13 +00:00
Jeff Wendling	d674bc9c52	satellite/audit: include failing segment info in logs Change-Id: I972fe19a2479f48bccc8a87a282467345a9dc1ec	2021-06-10 13:47:22 +03:00
Jeff Wendling	944bceabcd	satellite/audit: fix reservoir sampling bias Change-Id: Icc522fd86538b8182a1b7d42c1588c32a257acaf	2021-06-10 13:47:22 +03:00
JT Olio	da9ca0c650	testplanet/satellite: reduce the number of places default values need to be configured Satellites set their configuration values to default values using cfgstruct, however, it turns out our tests don't test these values at all! Instead, they have a completely separate definition system that is easy to forget about. As is to be expected, these values have drifted, and it appears in a few cases test planet is testing unreasonable values that we won't see in production, or perhaps worse, features enabled in production were missed and weren't enabled in testplanet. This change makes it so all values are configured the same, systematic way, so it's easy to see when test values are different than dev values or release values, and it's less hard to forget to enable features in testplanet. In terms of reviewing, this change should be actually fairly easy to review, considering private/testplanet/satellite.go keeps the current config system and the new one and confirms that they result in identical configurations, so you can be certain that nothing was missed and the config is all correct. You can also check the config lock to see what actual config values changed. Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e	2021-06-01 22:14:17 +00:00
Cameron Ayer	53322bb0a7	satellite/{audit,satellitedb}: release nodes from containment in Reverify rather than (Batch)UpdateStats Until now, whenever audits were recorded we would try to delete the node from containment just in case it exists. Since we now want to treat segment repair downloads as audits, this would erroneously remove nodes from containment, as repair does not go through a Reverify step. With this changeset, (Batch)UpdateStats will not remove nodes from containment. The Reverify method will remove all necessary nodes from containment. Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f	2021-06-01 21:02:44 +00:00
Egon Elbre	10a0216af5	satellite/metainfo: use range for specifying download limit Previously the object range was not used for calculating order limit. This meant that even if you were downloading only a small range it would account bandwidth based on the full segment. This doesn't fully address the accounting since the lazy segment downloads do not send their requested range nor requested limit. Change-Id: Ic811e570c889be87bac4293547d6537a255078da	2021-06-01 09:36:55 +00:00
Egon Elbre	910eec8eee	satellite/metainfo: remove MetabaseDB interface Currently the interface is not useful. When we need to vary the implementation for testing purposes we can introduce a local interface for the service/chore that needs it, rather than using the large api. Unfortunately, this requires adding a cleanup callback for tests, there might be a better solution to this problem. Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682	2021-05-13 13:22:14 +00:00
Egon Elbre	69b149a66f	mod: bump uplink uplink stopped using zap, hence some of the private methods needed to be changed. Change-Id: Iac1fae45a40cd3f1649b9f672bf8c250344986d5	2021-05-06 14:48:36 +00:00
Cameron Ayer	bb343d9028	satellite/satellitedb: don't remove offline nodes from containment When audits are being recorded, we automatically add some SQL to remove the node from the pending audits table in case it exists. They are removed from pending audits even if the node was offline for the audit. This is not the correct behavior. Add statement to record audit results in reverify tests to ensure no more false positives. Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17	2021-05-03 16:05:55 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Egon Elbre	4c9ed64f75	satellite/metabase/metaloop: move loop under metabase Currently the loop handling is heavily related to the metabase rather than metainfo. metainfo over time has become related to the "public API" for accessing the metabase data. Currently updates monkit.lock, because monkit monitoring does not handle ScopeNamed correctly. Needs a followup change to monitoring check. Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584	2021-04-22 12:58:09 +03:00
Egon Elbre	267506bb20	satellite/metabase: move package one level higher metabase has become a central concept and it's more suitable for it to be directly nested under satellite rather than being part of metainfo. metainfo is going to be the "endpoint" logic for handling requests. Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198	2021-04-21 15:54:22 +03:00
Fadila Khadar	bde367ae73	satellite/gc: check on bloom filter creation date Check that the bloom filter creation date is earlier than the metainfo loop system time used for db scanning. Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4	2021-04-14 16:40:37 +00:00
Egon Elbre	f19ef4afe5	satellite/metainfo/metaloop: move loop to a separate package Change-Id: I94c931a27c1af6062185ec62688624ec02050f11	2021-03-23 15:37:34 +00:00
Michał Niewrzał	237782813b	Merge remote-tracking branch 'origin/multipart-upload' Change-Id: If6c5a450b238adab55d1e0dea67d01e5f5768a9f	2021-03-23 09:44:49 +01:00
Michał Niewrzał	27ae0d1f15	satellite/metainfo/metabase: add NewRedundancy parameter for UpdateSegmentPieces method At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy. Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576	2021-03-22 08:12:56 +00:00
Cameron Ayer	a04495713d	satellite/audit: add missing logs for audit failure conditions Among other conditions, nodes fail audits by returning incorrect data and by reaching the max reverify count, but we weren't logging these events. This commit adds the missing logs. Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f	2021-03-18 17:33:11 +00:00
Michał Niewrzał	67e26aafcd	Merge remote-tracking branch 'origin/main' into multipart-upload Change-Id: I9b183323cb470185be22f7c648bb76917d2e6fca	2021-03-10 08:53:38 +01:00
Cameron Ayer	a44974a2f9	satellite/audit: fix pointless containment deletions Previously if node was not found in containment, it was given the status, 'skipped'. We later try to delete skipped nodes from containment. To fix this, add a new status called 'remove' to differentiate nodes which should be skipped and nodes which should be deleted. Change-Id: Ic09e62dc9723c89d0c9f968ce68c039114a9d74e	2021-03-02 13:40:18 -05:00
Kaloyan Raev	9aa61245d0	satellite/audits: migrate to metabase Change-Id: I480c941820c5b0bd3af0539d92b548189211acb2	2020-12-17 14:38:48 +02:00
Egon Elbre	12055e7864	all: minor cleanups Change-Id: I4248dbe36a62a223b06135254b32851485a2eec1	2020-12-16 10:47:46 +00:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Kaloyan Raev	53b7fd7b00	satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified We now have the piece hashes verified for all segments on all production satellites. We can remove the code that handles the case where piece hashes are not verified. This would make easier the migration of services from PointerDB to the new metabase. For consistency, PieceHashesVerified is still set to true in PointerDB for new segments. Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183	2020-11-24 11:09:48 +02:00
Moby von Briesen	db6bc6503d	satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes. Make metainfo.RSConfig a valid pflag config value. This allows us to configure the RSConfig as a string like k/m/o/n-shareSize, which makes having multiple supported RS schemes easier in the future. RS-related config values that are no longer needed have been removed (MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify). Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1	2020-11-09 22:16:13 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Kaloyan Raev	92a2be2abd	satellite/metainfo: get away from using pb.Pointer in Metainfo Loop As part of the Metainfo Refactoring, we need to make the Metainfo Loop working with both the current PointerDB and the new Metabase. Thus, the Metainfo Loop should pass to the Observer interface more specific Object and Segment types instead of pb.Pointer. After this change, there are still a couple of use cases that require access to the pb.Pointer (hence we have it as a field in the metainfo.Segment type): 1. Expired Deletion Service 2. Repair Service It would require additional refactoring in these two services before we are able to clean this. Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f	2020-10-27 13:06:47 +00:00
Cameron Ayer	bb7be23115	satellite/{audit,overlay,satellitedb}: enable reporting offline audits - Remove flag for switching off offline audit reporting. - Change the overlay method used from UpdateUptime to BatchUpdateStats, as this is where the new online scoring is done. - Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same method to record offline audits as success, failure, and unknown, we need to distinguish offline audits from the rest. Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355	2020-10-27 10:44:46 +00:00
Kaloyan Raev	1f386db566	cmd/satellite: remove metainfo commands (#3955 )	2020-10-22 13:33:09 +03:00
Kaloyan Raev	1aeb14e65e	satellite/audit: do not delete expired segments A year ago we made the audit service deleting expired segments. Meanwhile, we introduced an expired deletetion sub-service in the metainfo service which sole purpose is deleting expired segments. Therefore, now we are removing this responsibility from the audit service. It will continue to avoid reporting failures on expired segments, but it would not delete them anymore. We do this to cleanup responsibilities in advance of the metainfo refactoring. Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4	2020-10-22 08:24:16 +00:00
paul cannon	360ab17869	satellite/audit: use LastIPAndPort preferentially This preserves the last_ip_and_port field from node lookups through CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later calls to (Verifier).GetShare() can try to use that IP and port. If a connection to the given IP and port cannot be made, or the connection cannot be verified and secured with the target node identity, an attempt is made to connect to the original node address instead. A similar change is not necessary to the other CreateOrderLimits functions, because they already replace node addresses with the cached IP and port as appropriate. We might want to consider making a similar change to CreateGetRepairOrderLimits(), though. The audit situation is unique because the ramifications are especially powerful when we get the address wrong. Failing a single audit can have a heavy cost to a storage node. We need to make extra effort in order to avoid imposing that cost unfairly. Situation 1: If an audit fails because the repair worker failed to make a DNS query (which might well be the fault on the satellite side), and we have last_ip_and_port information available for the target node, it would be unfair not to try connecting to that last_ip_and_port address. Situation 2: If a node has changed addresses recently and the operator correctly changed its DNS entry, but we don't bother querying DNS, it would be unfair to penalize the node for our failure to connect to it. So the audit worker must try both last_ip_and_port _and_ the node address as supplied by the SNO. We elect here to try last_ip_and_port first, on the grounds that (a) it is expected to work in the large majority of cases, and (b) there should not be any security concerns with connecting to an out-or-date address, and (c) avoiding DNS queries on the satellite side helps alleviate satellite operational load. Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200	2020-10-21 13:34:40 +00:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Kaloyan Raev	e7f2ec7ddf	satellite/audit: fix sanity check for verify-piece-hashes command The VerifyPieceHashes method has a sanity check for the number pieces to be removed from the pointer after the audit for verifying the piece hashes. This sanity check failed when we executed the command on the production satellites because the Verify command removes Fails and PendingAudits nodes from the audit report if piece_hashes_verified = false. A new temporary UsedToVerifyPieceHashes flag is added to audits.Verifier. It is set to true only by the verify-piece-hashes command. If the flag is true then the Verify method will always include Fails and PendingAudits nodes in the report. Test case is added to cover this use case. Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8	2020-10-07 17:17:48 +03:00
Yingrong Zhao	c085a17a52	bump common and uplink to latest Change-Id: I717f0214dd9973acd51b7732c5d64587f610c805	2020-10-01 15:38:58 +00:00

1 2 3

147 Commits