storj

Author	SHA1	Message	Date
Kaloyan Raev	53b7fd7b00	satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified We now have the piece hashes verified for all segments on all production satellites. We can remove the code that handles the case where piece hashes are not verified. This would make easier the migration of services from PointerDB to the new metabase. For consistency, PieceHashesVerified is still set to true in PointerDB for new segments. Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183	2020-11-24 11:09:48 +02:00
Moby von Briesen	db6bc6503d	satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes. Make metainfo.RSConfig a valid pflag config value. This allows us to configure the RSConfig as a string like k/m/o/n-shareSize, which makes having multiple supported RS schemes easier in the future. RS-related config values that are no longer needed have been removed (MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify). Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1	2020-11-09 22:16:13 +00:00
Cameron Ayer	dc67ce74c9	satellite: remove IsUp field from overlay.UpdateRequest With the new overlay.AuditOutcome type for offline audits, the IsUp field is redundant. If AuditOutcome != AuditOffline, then the node is online. In addition to removing the field itself, other changes needed to be made regarding the relationship between 'uptime' and 'audits'. Previously, uptime and audit outcome were completely separated. For example, it was possible to update a node's stats to give it a successful/failed/unknown audit while simultaneously indicating that the node was offline by setting IsUp to false. This is no longer possible under this changeset. Some test which did this have been changed slightly in order to pass. Also add new benchmarks for UpdateStats and BatchUpdateStats with different audit outcomes. Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b	2020-11-02 15:34:17 -05:00
Kaloyan Raev	92a2be2abd	satellite/metainfo: get away from using pb.Pointer in Metainfo Loop As part of the Metainfo Refactoring, we need to make the Metainfo Loop working with both the current PointerDB and the new Metabase. Thus, the Metainfo Loop should pass to the Observer interface more specific Object and Segment types instead of pb.Pointer. After this change, there are still a couple of use cases that require access to the pb.Pointer (hence we have it as a field in the metainfo.Segment type): 1. Expired Deletion Service 2. Repair Service It would require additional refactoring in these two services before we are able to clean this. Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f	2020-10-27 13:06:47 +00:00
Cameron Ayer	bb7be23115	satellite/{audit,overlay,satellitedb}: enable reporting offline audits - Remove flag for switching off offline audit reporting. - Change the overlay method used from UpdateUptime to BatchUpdateStats, as this is where the new online scoring is done. - Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same method to record offline audits as success, failure, and unknown, we need to distinguish offline audits from the rest. Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355	2020-10-27 10:44:46 +00:00
Kaloyan Raev	1f386db566	cmd/satellite: remove metainfo commands (#3955 )	2020-10-22 13:33:09 +03:00
Kaloyan Raev	1aeb14e65e	satellite/audit: do not delete expired segments A year ago we made the audit service deleting expired segments. Meanwhile, we introduced an expired deletetion sub-service in the metainfo service which sole purpose is deleting expired segments. Therefore, now we are removing this responsibility from the audit service. It will continue to avoid reporting failures on expired segments, but it would not delete them anymore. We do this to cleanup responsibilities in advance of the metainfo refactoring. Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4	2020-10-22 08:24:16 +00:00
paul cannon	360ab17869	satellite/audit: use LastIPAndPort preferentially This preserves the last_ip_and_port field from node lookups through CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later calls to (Verifier).GetShare() can try to use that IP and port. If a connection to the given IP and port cannot be made, or the connection cannot be verified and secured with the target node identity, an attempt is made to connect to the original node address instead. A similar change is not necessary to the other CreateOrderLimits functions, because they already replace node addresses with the cached IP and port as appropriate. We might want to consider making a similar change to CreateGetRepairOrderLimits(), though. The audit situation is unique because the ramifications are especially powerful when we get the address wrong. Failing a single audit can have a heavy cost to a storage node. We need to make extra effort in order to avoid imposing that cost unfairly. Situation 1: If an audit fails because the repair worker failed to make a DNS query (which might well be the fault on the satellite side), and we have last_ip_and_port information available for the target node, it would be unfair not to try connecting to that last_ip_and_port address. Situation 2: If a node has changed addresses recently and the operator correctly changed its DNS entry, but we don't bother querying DNS, it would be unfair to penalize the node for our failure to connect to it. So the audit worker must try both last_ip_and_port _and_ the node address as supplied by the SNO. We elect here to try last_ip_and_port first, on the grounds that (a) it is expected to work in the large majority of cases, and (b) there should not be any security concerns with connecting to an out-or-date address, and (c) avoiding DNS queries on the satellite side helps alleviate satellite operational load. Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200	2020-10-21 13:34:40 +00:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Kaloyan Raev	e7f2ec7ddf	satellite/audit: fix sanity check for verify-piece-hashes command The VerifyPieceHashes method has a sanity check for the number pieces to be removed from the pointer after the audit for verifying the piece hashes. This sanity check failed when we executed the command on the production satellites because the Verify command removes Fails and PendingAudits nodes from the audit report if piece_hashes_verified = false. A new temporary UsedToVerifyPieceHashes flag is added to audits.Verifier. It is set to true only by the verify-piece-hashes command. If the flag is true then the Verify method will always include Fails and PendingAudits nodes in the report. Test case is added to cover this use case. Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8	2020-10-07 17:17:48 +03:00
Yingrong Zhao	c085a17a52	bump common and uplink to latest Change-Id: I717f0214dd9973acd51b7732c5d64587f610c805	2020-10-01 15:38:58 +00:00
Kaloyan Raev	b409b53f7f	cmd/satellite: command for verifying piece hashes Jira: https://storjlabs.atlassian.net/browse/PG-69 There are a number of segments with piece_hashes_verified = false in their metadata) on US-Central-1, Europe-West-1, and Asia-East-1 satellites. Most probably, this happened due to a bug we had in the past. We want to verify them before executing the main migration to metabase. This would simplify the main migration to metabase with one less issue to think about. Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236	2020-09-29 10:58:24 +00:00
Michal Niewrzal	9202295348	satellite/metainfo: replace ScopedPath with metabase.SegmentLocation Change-Id: I7e89c9e8eaeae58be828a32ad47ed3028501f4c7	2020-09-04 10:06:52 +00:00
Michal Niewrzal	aa47e70f03	satellite/metainfo: use metabase.SegmentKey with metainfo.Service Instead of using string or []byte we will be using dedicated type SegmentKey. Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220	2020-09-03 15:11:32 +00:00
Moby von Briesen	5d21e85529	satellite/audit/queue: Separate audit queue into two separate structs. * The audit worker wants to get items from the queue and process them. * The audit chore wants to create new queues and swap them in when the old queue has been processed. This change adds a "Queues" struct which handles the concurrency issues around the worker fetching a queue and the chore swapping a new queue in. It simplifies the logic of the "Queue" struct to its bare bones, so that it behaves like a normal queue with no need to understand the details of swapping and worker/chore interactions. Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b	2020-08-31 20:51:25 +00:00
Egon Elbre	c86c732fc0	satellite: simplify tests satellite.DB.Console().Projects().GetAll database query can be replaced with planet.Uplinks[0].Projects[0].ID Change-Id: I73b82b91afb2dde7b690917345b798f9d81f6831	2020-08-28 22:28:04 +00:00
Egon Elbre	3ca405aa97	satellite/orders: use metabase types as arguments Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef	2020-08-28 15:52:37 +03:00
Moby von Briesen	4f28bf0720	satellite/audit: Do not return errors from Verify or Reverify on segment modified, expired, or deleted If a segment is deleted, is modified, or expires during an audit, this is not problematic, so we should not return errors. Functionally, nothing changes, but our metrics around audit success rate will be improved after this change. Change-Id: Ic11df056b2c73894b67a55894bd4d58c00470606	2020-08-26 13:24:00 +00:00
Qweder93	01bb2bd17d	satellite/audit: verifier checks if node made sucess GE before auditing Change-Id: Ia6cde4e9fcf11020a5301d38065f7159f276eb80	2020-08-17 23:37:57 +03:00
Egon Elbre	94a09ce20b	all: add missing dots Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a	2020-08-11 17:50:01 +03:00
Moby von Briesen	76030a8237	satellite/audit/{queue,chore}: Wait for audit queue to be finished before swapping * Do not swap the active audit queue with the pending audit queue until the active audit queue is empty. * Do not begin creating a new pending audit queue until the existing pending audit queue has been swapped to the active queue. Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317	2020-07-28 16:56:26 +00:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Cameron Ayer	cadb435d25	{satellite/audit, private/testplanet}: remove ErrAlreadyExists, run 2 audit workers in testplanet Since we increased the number of concurrent audit workers to two, there are going to be instances of a single node being audited simultaneously for different segments. If the node times out for both, we will try to write them both to the pending audits table, and the second will return an error since the path is not the same as what already exists. Since with concurrent workers this is expected, we will log the occurrence rather than return an error. Since the release default audit concurrency is 2, update testplanet default to run with concurrent workers as well. Change-Id: I4e657693fa3e825713a219af3835ae287bb062cb	2020-06-30 18:00:07 +00:00
Cameron Ayer	3b4b5f45c7	satellite: replace references to Suspended with UnknownAuditSuspended Change-Id: I3d2d00c95954c0546ad077702617895f262926ef	2020-06-23 14:19:22 +00:00
Egon Elbre	410d897840	satellite: fix string(int) conversions Change-Id: I54c6ca8c2dad3c321175f72271b7536cc2a4df09	2020-06-12 06:41:34 +00:00
Cameron Ayer	26fce54b11	satellite/audit: increase MinDownloadTimeout in TestReverifySlowDownload This test failed due to a timeout on a download which is supposed to succeed. The testplanet default for the value is 5 seconds, but here it is 500 milliseconds. It looks like this is due to the fact that later in the test we need to wait for a slow node to timeout, so we cut the timeout shorter to reduce test time. This PR increases the timeout to 1 second. Still not too long to wait, but gives us twice as much time to download, decreasing the likelihood that we see the timeout error. Change-Id: I504db39ab5dc4d3c505520337b258265d6da7020	2020-06-10 17:43:50 +00:00
Jeff Wendling	943eb872d3	satellite/audit: depend less on details of some error message the error message changed when we removed spacemonkeygo/errors Change-Id: I4904a072cfd84e4c39c881b58669325bcf51df46	2020-06-05 10:39:05 -06:00
Michal Niewrzal	84892631c8	private/testplanet: remove old libuplink from testplanet Change-Id: Ib1553f84d0b3ae12a5b00382f0f53357b6a273e2	2020-05-28 13:50:23 +00:00
Jennifer Johnson	03e5f922c3	satellite/overlay: updates node with a vetted_at timestamp if they meet the vetting criteria What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count are greater than the configured thresholds), we set vetted_at to the current timestamp. Why: We may want to use this timestamp in future development to select new vs vetted nodes. It also allows flexibility in node vetting experiments and allows for better metrics around vetting times. Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and commits another variable to the db (vetted_at), but this should be negligible. Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d	2020-05-20 16:30:26 -04:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
Egon Elbre	bcd93ee375	private/testplanet: add StopNodeAndUpdate This was commonly used and code with it can be simplified. Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39	2020-05-08 13:02:19 +00:00
Egon Elbre	678b859172	satellite/overlay: remove MinimumRequiredNodes In non-test code we were only using RequestedCount, not need to have MinimumRequiredNodes. Change-Id: I40736f4b028b41e94abfdeb221bce5aa86a5cb82	2020-05-07 15:41:23 +00:00
Egon Elbre	4e94da3fda	satellite/overlay: add feature flag for node selection cache Also distinguish the purpose for selecting nodes to avoid potential confusion, what should allow caching and what shouldn't. Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61	2020-05-06 16:13:47 +03:00
Jennifer Johnson	18078bf7ee	satellite/audit: increases audit worker concurrency to 2 Change-Id: Ibe3e3801b79accffbcfe9e2e02c96fc963894a7f	2020-05-05 11:31:55 +00:00
Moby von Briesen	8f60cfc4fb	satellite/overlay: Add flag for enabling/disabling disqualification from suspension mode Add a flag that allows us to easily switch disqualification from suspension mode on or off. A node will only be disqualified from suspension mode if it has been suspended for longer than the grace period AND the SuspensionDQEnabled flag is true. Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792	2020-05-04 17:25:09 +00:00
Isaac Hess	5a85e8d749	satellite/audit: Fix flaky TestVerifierSlowDownload TestVerifierSlowDownload would sometimes not have enough nodes finish in the allotted deadline period. This increases the deadline and also does not assert that exactly 3 have finished. Instead, in keeping with the purpose of the test, it asserts that the slow download is never counted as a success and is always counted as a pending audit in the final report. Change-Id: I180734fcc4a499420c75164bad6253ed155d87de	2020-04-30 15:58:47 -06:00
Moby von Briesen	de366537a8	satellite/satellitedb/overlaycache: fix behavior around gracefully exited nodes Sometimes nodes who have gracefully exited will still be holding pieces according to the satellite. This has some unintended side effects currently, such as nodes getting disqualified after having successfully exited. * When the audit reporter attempts to update node stats, do not update stats (alpha, beta, suspension, disqualification) if the node has finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated) * Treat gracefully exited nodes as "not reputable" so that the repairer and checker do not count them as healthy (overlay/statdb_test.go TestKnownUnreliableOrOffline, repair/repair_test.go TestRepairGracefullyExited) Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272	2020-04-28 23:58:43 +00:00
Jess G	825226c98e	satellite/overlay: use node selection cache for uploads (#3859 ) * satellite/overlay: use node selection cache for uploads Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad * fix config lock Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd * start fixing tests Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf * fix test, add some more Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722 * change config name Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b * add benchmarks Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9 * revert bench to put in different PR Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade * add staleness to benchmark Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870 * add cache config to testplanet Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff * have repair select old way Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252 * lower testplante config time Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7 * fix test Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69	2020-04-24 09:11:04 -07:00
Moby von Briesen	72b93f3120	satellite/satellitedb: disqualify suspended nodes when the grace period passes If a node is suspended and receives an unknown or failing audit, disqualify them if the grace period (default 1w in production) has passed. Migrate the nodes table so any node that is currently suspended gets unsuspended when the satellite starts up. Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82	2020-04-22 15:45:00 -04:00
Moby von Briesen	d7794a4851	satellite/overlay: hardcode default values for audit alpha/beta Alpha=1 and beta=0 are the expected first values for any alpha/beta reputation system we are using in the codebase. So we are removing the configurability of these values. Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3	2020-04-14 19:12:40 +00:00
Cameron Ayer	02613407ae	satellite/satellitedb: only suspend node if not already suspended Whenever the node's reputation is updated, if its unknown audit reputation is below the suspension threshold, its suspension field is set to the current time. This could overwrite the previous "suspendedAt" value resulting a node that never reaches the end of its suspension. Also log whenever a node is disqualified or its suspension status changes Change-Id: I5e8c8f1c46f66d79cb279b5b16a84fe03f533deb	2020-04-10 09:37:37 +00:00
Egon Elbre	11a44cdd88	all: don't depend on gogo/proto directly Change-Id: I8822dea0d1b7b99e0b828e0373a0308a42dde2be	2020-04-08 17:32:15 +00:00
paul cannon	0c8c11b251	satellite/audit: add not_enough_shares_for_audit counter We have been using the SQL expression `name='(*Verifier).Verify' AND error_name='not enough shares for successful audit'` thus far to detect cases of this problem and alert on them. Unfortunately, since this rarely (hopefully never) happens, influxdb has no data for most of the auditor instances, and when it has no data for a time series, it returns no columns either. This makes Redash upset when it tries to perform a query for an alert and can't find the column whose value it expects to check. This change should make it so zero values are reported when the problem has not happened, and higher values when it has. Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1	2020-04-03 17:00:50 +00:00
littleskunk	23e5a0471f	satellite/audit: clean up logging (#3832 ) Co-authored-by: Ivan Fraixedes <ivan@fraixed.es>	2020-03-30 12:09:50 -06:00
Egon Elbre	cb781d66c7	satellite/overlay: optimize FindStorageNodes Reduce the number of fields returned from the query. Benchmark results in `satellite/overlay`: benchstat before.txt after2.txt name old time/op new time/op delta SelectStorageNodes-32 7.85ms ± 1% 6.27ms ± 1% -20.18% (p=0.002 n=10+4) SelectNewStorageNodes-32 8.21ms ± 1% 6.61ms ± 0% -19.53% (p=0.002 n=10+4) SelectStorageNodesExclusion-32 17.2ms ± 1% 15.9ms ± 1% -7.55% (p=0.002 n=10+4) SelectNewStorageNodesExclusion-32 17.8ms ± 2% 16.1ms ± 0% -9.38% (p=0.002 n=10+4) FindStorageNodes-32 48.4ms ± 1% 45.1ms ± 0% -6.69% (p=0.002 n=10+4) FindStorageNodesExclusion-32 79.2ms ± 1% 76.1ms ± 1% -3.89% (p=0.002 n=10+4) Benchmark results from `satellite/overlay` after making them parallel: benchstat before-parallel.txt after2-parallel.txt name old time/op new time/op delta SelectStorageNodes-32 548µs ± 1% 353µs ± 1% -35.60% (p=0.029 n=4+4) SelectNewStorageNodes-32 562µs ± 0% 368µs ± 0% -34.51% (p=0.029 n=4+4) SelectStorageNodesExclusion-32 1.02ms ± 1% 0.84ms ± 0% -18.08% (p=0.029 n=4+4) SelectNewStorageNodesExclusion-32 1.03ms ± 1% 0.86ms ± 2% -16.22% (p=0.029 n=4+4) FindStorageNodes-32 3.11ms ± 0% 2.79ms ± 1% -10.27% (p=0.029 n=4+4) FindStorageNodesExclusion-32 4.75ms ± 0% 4.43ms ± 1% -6.56% (p=0.029 n=4+4) Change-Id: I1d85e2764eb270f4c2b1998303ccfc1179d65b26	2020-03-30 18:36:23 +03:00
Egon Elbre	e1a443b04a	private/testplanet: allow modifying created database Instead of providing the database from outside to testplanet create it inside and then allow wrapping and modifying it. This is more convenient to use. Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c	2020-03-27 19:14:48 +00:00
Egon Elbre	e8f18a2cfe	private/testplanet: expose storagenode and satellite Config Change-Id: I80fe7ed8ef7356948879afcc6ecb984c5d1a6b9d	2020-03-27 17:01:25 +02:00
Moby von Briesen	2f991b6c56	satellite/{overlay, satellitedb}: account for `suspended` field in overlay cache Make sure that suspended nodes are treated appropriately by the overlay cache. This means we should expect the following behavior: * suspended nodes (vetted or not) should not be selected for uploading new segments * suspended nodes should be treated by the checker and repairer as "unhealthy", and should be removed upon successful repair This commit also removes unused overlay functionality. Fixes a bug with commit `8b72181a1f` where the audit reporter was automatically suspending nodes regardless of audit outcome (see test added). Tests: * updates repair tests to ensure that a suspended node is treated as unhealthy and will be removed from the pointer on successful repair * updates overlay tests for KnownUnreliableOrOffline and KnownReliable to expect suspended nodes to be considered "unreliable" * adds satellitedb test that ensures overlay.SelectStorageNodes and overlay.SelectNewStorageNodes do not include suspended nodes * adds audit reporter test to ensure that different audit outcomes result in the correct suspended/disqualified states Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561	2020-03-17 17:14:56 +00:00
Moby von Briesen	8b72181a1f	satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension * change overlay.UpdateStats to allow a third audit outcome. Now it can handle successful, failed, and unknown audits. * when "unknown audit reputation" (unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the DQ threshold, put node into suspension. * when unknown audit reputation goes above the DQ threshold, remove node from suspension. * record unknown audits from audit reporter. * add basic tests around unknown audits and suspension. Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1	2020-03-16 20:29:26 +00:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Jess G	39cb821196	satellite/overlay: rm combinedcache, fix IP naming to be network (#3798 ) * rn combinedcache, rm dns node lookup Change-Id: I239f07211764b097d851230d8c81900a47756e9e * excludeIPs -> excludedNetworks Change-Id: Ifa6f44ab17457cdd5aff4cd5694296867c18b179 * use lowercase var name Change-Id: I825aad2b718c71f455e747be18f8cabd02aabe55 * update Getnetwork name Change-Id: I002a1b7bc6b4ef40159c0cd2b0ef209f80a9c503 * fix comments Change-Id: Ibddf5b9ffa9d685af6c392d893db063ef18e45fa * update comments with ipv6 Change-Id: I31758b7d4979e7c27d014668f4fb532ad838cda2 Co-authored-by: Stefan Benten <mail@stefan-benten.de>	2020-03-12 11:37:57 -07:00
Jennifer Johnson	0d60c1a4b2	satellite/audit: fix checkSegmentAltered to detect segments that have changed during an audit - Previously, checkSegmentAltered only checked for segments that were replaced but we want to detect all changes to a segment that occurred while an audit was being conducted. - Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified segments were not being removed from containment mode. - Filled in gaps in reverify testing to ensure nodes are properly removed from containment. Change-Id: Icd96d369278987200fd28581395725438972b292	2020-03-05 19:05:39 +00:00
Jennifer Johnson	1c1750e6be	removes bandwidth limiting On satellite, remove all references to free_bandwidth column in nodes table. On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated. Protobuf message, NodeCapacity, is left intact for backwards compatibility. Once this is released to all satellites, we can drop the column from the DB. Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838	2020-03-04 14:04:00 +00:00
Moby von Briesen	6043d01c90	satellite/audit/verifier: add metric for number of successfully downloaded shares Change-Id: Ia4f1dc6e088db802e340aaecf80cc7ef6dc237a4	2020-02-27 14:33:59 +00:00
Egon Elbre	5342dd9fe6	go.mod: update uplink Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811	2020-02-21 16:08:12 +02:00
Cameron Ayer	b22bf16b35	satellite/overlay: add config flag for node selection free disk requirement Currently SNs report their free disk space once per hour. If a node becomes full, it has to wait until the next contact cycle begins to report; all the while receiving and failing upload requests. By increasing the minimum required disk space, we can give the storage nodes more time to report their space before the completely fill up. This change goes hand-in-hand with another change we want to implement: trigger capacity report on SN immediately upon falling below threshold. Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27	2020-02-11 18:08:25 +00:00
Michal Niewrzal	426c8eb31a	private/testplanet: add DeleteBucket method for uplink New method added to be able to delete easily bucket during tests. Change-Id: Iaae89618cc676ddbbbd4b0df2eeacd143ea6f3c2	2020-02-11 15:58:13 +00:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Egon Elbre	8dea4f52db	satellite: add control panel Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b	2020-01-30 08:06:43 -05:00
Michal Niewrzal	6502454947	satellite/metainfo: move RS configuration to satellite With this change RS configuration will be set on satellite. Uplink with get RS values with BeginObject request and will use it. For backward compatibility and to avoid super large change redundancy scheme stored with bucket is not touched. This can be done in future. Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff	2020-01-22 09:33:53 +00:00
Yingrong Zhao	76ee8a1b4c	satellite: remove UptimeReputation configs from codebase With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and UptimeReputationDQ. This is the first step of removing the uptime reputation columns from satellitedb Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36	2020-01-08 18:54:15 +00:00
Egon Elbre	082ec81714	uplink: move to storj.io/uplink (#3746 )	2020-01-08 15:40:19 +02:00
Egon Elbre	f41d440944	all: reduce number of log messages Remove starting up messages from peers. We expect all of them to start, if they don't, then they should return an error why they don't start. The only informative message is when a service is disabled. When doing initial database setup then each migration step isn't informative, hence print only a single line with the final version. Also use shorter log scopes. Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93	2020-01-06 19:03:46 +00:00
Egon Elbre	2680bae88c	private/testplanet: remove dependency to uplink Remove direct dependency on uplink.RSConfig, this simplifies moving the config file without introducing weird dependencies. Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec	2020-01-02 09:40:46 +00:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
paul cannon	af24581ac0	satellite/audit: do not report offline to overlay (#3547 )	2019-12-18 04:51:24 -06:00
Moby von Briesen	ab777e823e	do not update pointer for failed audits Change-Id: If88dce8928db28d6f53c3dc771e14ea97aae9661	2019-12-16 10:50:54 -05:00
Egon Elbre	72d407559e	satellite/metainfo: don't leak error implementation detail (#3722 ) * satellite/metainfo: don't leak implementation detail * add missing wrap	2019-12-10 15:21:30 -05:00
Jeff Wendling	17b057b33e	satellite/audit: monitor worker function Change-Id: I94d1161deffe4ea9782abee1afbb5735f18aab44	2019-11-25 17:58:13 +00:00
Maximillian von Briesen	8653dda2b1	satellite/audit: do not contain nodes for unknown errors (#3592 ) * skip unknown errors (wip) * add tests to make sure nodes that time out are added to containment * add bad blobs store * call "Skipped" "Unknown" * add tests to ensure unknown errors do not trigger containment * add monkit stats to lockfile * typo * add periods to end of bad blobs comments	2019-11-19 17:30:28 +01:00
littleskunk	8b3444e088	satellite/nodeselection: don't select nodes that haven't checked in for a while (#3567 ) * satellite/nodeselection: dont select nodes that havent checked in for a while * change testplanet online window to one minute * remove satellite reconfigure online window = 0 in repair tests * pass timestamp into UpdateCheckIn * change timestamp to timestamptz * edit tests to set last_contact_success to 4 hours ago * fix syntax error * remove check for last_contact_success > last_contact_failure in IsOnline	2019-11-15 23:43:06 +01:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Egon Elbre	cc032d3151	satellite/metainfo: fix some uses of metainfo.Delete (#3513 ) * satellite/metainfo: rename Delete to UnsynchronizedDelete * fix deletes * make db private * fix typos * also verify on commit object	2019-11-06 18:02:14 +01:00
Maximillian von Briesen	7cdc1b351a	satellite/audit: do not audit expired segments (#3497 ) * during audit Verify, return error and delete segment if segment is expired * delete "main" reverify segment and return error if expired * delete contained nodes and pointers when pointers to audit are expired * update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now * Revert "update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now" This reverts commit e9066151cf84afbff0929a6007e641711a56b6e5. * do not count ExpirationDate=time.Time{} as expired	2019-11-05 20:41:48 +01:00
littleskunk	def3dcbaa9	satellite/audit: increase timeout to 5 minutes (#3480 ) * satellite/audit: increase timeout to 5 minutes * fix lint error	2019-11-05 11:21:25 +01:00
JT Olio	2c6fa3c5f8	pkg/rpc: remove read/write deadlines as a mechanism for request timeouts (#3335 ) libuplink was incorrectly setting timeouts to 10 seconds still, but should have been at least 10 minutes. the order sender was setting them to 1 hour. we don't want timeouts in uplink-side logic as it establishes a minimum rate on tcp streams. instead of all of this, just use tcp keep alive. tcp keep alive packets are sent every 15 seconds and if the peer stops responding the connection dies. this is enabled by default with go. this will kill tcp connections when they stop working. Change-Id: I3d7ad49f71950b3eb43044eedf4b17993116045b	2019-10-22 17:57:24 -06:00
littleskunk	eeb38245ff	satellite/audit: improve logging (#3285 )	2019-10-16 13:48:05 +02:00
JT Olio	a5d1776539	audits: missing continue Change-Id: Ifcac8e61ebd8c59407e01c791adc60d9f88ff1b7	2019-10-11 13:55:27 -06:00
Maximillian von Briesen	784ca1582a	satellite/audit: fix audit panic (#3217 )	2019-10-09 10:06:58 -04:00
Maximillian von Briesen	3a3d576d9b	satellite/audit: add mutex to pieceHashesVerified map (#3214 ) * add mutex * remove double send to ch * lock mutex inside defer * import sync	2019-10-08 17:01:32 -04:00
littleskunk	c009543236	satellite/audit: Add piece hash verified to log messages (#3204 )	2019-10-08 12:51:57 +02:00
Maximillian von Briesen	e1b7d01160	satellite/audit: do not fail or contain nodes for audited segments that are not piece-hash-verified (#3161 )	2019-10-07 16:06:10 -04:00
Maximillian von Briesen	edadf46009	satellite/audit: delete nodes from containment when segment has changed (#3115 )	2019-09-29 04:03:15 +02:00
Jeff Wendling	098cbc9c67	all: use pkg/rpc instead of pkg/transport all of the packages and tests work with both grpc and drpc. we'll probably need to do some jenkins pipelines to run the tests with drpc as well. most of the changes are really due to a bit of cleanup of the pkg/transport.Client api into an rpc.Dialer in the spirit of a net.Dialer. now that we don't need observers, we can pass around stateless configuration to everything rather than stateful things that issue observations. it also adds a DialAddressID for the case where we don't have a pb.Node, but we do have an address and want to assert some ID. this happened pretty frequently, and now there's no more weird contortions creating custom tls options, etc. a lot of the other changes are being consistent/using the abstractions in the rpc package to do rpc style things like finding peer information, or checking status codes. Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412	2019-09-25 15:37:06 -06:00
Jennifer Li Johnson	724bb44723	Remove Kademlia dependencies from Satellite and Storagenode (#2966 ) What: cmd/inspector/main.go: removes kad commands internal/testplanet/planet.go: Waits for contact chore to finish satellite/contact/nodesservice.go: creates an empty nodes service implementation satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover() satellite/peer.go: sets up contact service and endpoints storagenode/console/service.go: replaces nodeID with contact.Local() storagenode/contact/chore.go: replaces routing table with contact service storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method storagenode/contact/service.go: creates a service to return the local node and update its own capacity storagenode/monitor/monitor.go: uses contact service in place of routing table storagenode/operator.go: moves operatorconfig from kad into its own setup storagenode/peer.go: sets up contact service, chore, pingstats and endpoints satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection Removes kademlia setups in: cmd/storagenode/main.go cmd/storj-sim/network.go internal/testplane/planet.go internal/testplanet/satellite.go internal/testplanet/storagenode.go satellite/peer.go scripts/test-sim-backwards.sh scripts/testdata/satellite-config.yaml.lock storagenode/inspector/inspector.go storagenode/peer.go storagenode/storagenodedb/database.go Why: Replacing Kademlia Please describe the tests: • internal/testplanet/planet_test.go: TestBasic: assert that the storagenode can check in with the satellite without any errors TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup • satellite/contact/contact_test.go: TestFetchInfo: Tests that the FetchInfo method returns the correct info • storagenode/contact/contact_test.go: TestNodeInfoUpdated: tests that the contact chore updates the node information TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).	2019-09-19 15:56:34 -04:00
Jess G	93788e5218	remove kademlia: create upsert query to update uptime (#2999 ) * create upsert query for check-in method * add tests * fix lint err * add benchmark test for db query * fix lint and tests * add a unit test, fix lint * add address to tests * replace print w/ b.Fatal * refactor query per CR comments * fix disqualified, only set if null * fix query * add version to updatecheckin query * fix version * fix tests * change version for tests * add version to tests * add IP, add transport, mv unit test * use node.address as arg * add last ip * fix lint	2019-09-19 11:37:31 -07:00
Maximillian von Briesen	d22987ea1d	satellite/audit: Fix flakiness in TestReverifyDifferentShare	2019-09-19 10:50:16 -04:00
Maximillian von Briesen	a4048fd529	satellite/audit: fix containment mode (#3085 ) * add test to make sure we will reverify the share in the containment db rather than in the pointer passed into reverify * use pending audit information only when running reverify	2019-09-19 01:45:15 +02:00
Jess G	7c203b4884	add satelliteSystem to testplanet and update tests (#3066 )	2019-09-17 13:14:49 -07:00
Natalie Villasana	4aaf525bd3	satellite/audit: set devDefaults for ChoreInterval and QueueInterval to 1m (#3058 )	2019-09-16 16:36:33 -04:00
Egon Elbre	7240e6cbb2	satellite: remove remote/inline file from BucketTally (#3041 )	2019-09-13 16:51:41 +03:00
Egon Elbre	8b668ab1f8	satellite/metainfo.Loop: use a parsed path for observers (#3003 )	2019-09-12 13:38:49 +03:00
Natalie Villasana	aa3567187e	satellite/audit: worker now verifies and reverifies (#2965 )	2019-09-11 18:37:01 -04:00
Egon Elbre	a801fab66a	all: add archview annotations (#2964 )	2019-09-10 16:24:16 +03:00
Natalie Villasana	6d363fb756	satellite/audit: create the audit queue, chore, and worker (#2888 )	2019-09-05 11:40:52 -04:00
Yingrong Zhao	8eda360ad3	add segment path into logs (#2898 )	2019-08-29 08:38:26 -04:00
Natalie Villasana	49303ea3ac	satellite/audit: mv ReservoirService into its own file (#2886 )	2019-08-27 13:39:51 -04:00
Ivan Fraixedes	df29699641	satellite/audit: Improve code comment in reporter (#2838 )	2019-08-22 14:13:43 +02:00
Egon Elbre	00b2e1a7d7	all: enable staticcheck (#2849 ) * by having megacheck in disable it also disabled staticcheck * fix closing body * keep interfacer disabled * hide bodies * don't use deprecated func * fix dead code * fix potential overrun * keep stylecheck disabled * don't pass nil as context * fix infinite recursion * remove extraneous return * fix data race * use correct func * ignore unused var * remove unused consts	2019-08-22 13:40:15 +02:00
Egon Elbre	2d69d47655	all: fix Error.New formatting (#2840 )	2019-08-21 19:30:29 +03:00

1 2 3 4

158 Commits