storj

Author	SHA1	Message	Date
Cameron Ayer	cadb435d25	{satellite/audit, private/testplanet}: remove ErrAlreadyExists, run 2 audit workers in testplanet Since we increased the number of concurrent audit workers to two, there are going to be instances of a single node being audited simultaneously for different segments. If the node times out for both, we will try to write them both to the pending audits table, and the second will return an error since the path is not the same as what already exists. Since with concurrent workers this is expected, we will log the occurrence rather than return an error. Since the release default audit concurrency is 2, update testplanet default to run with concurrent workers as well. Change-Id: I4e657693fa3e825713a219af3835ae287bb062cb	2020-06-30 18:00:07 +00:00
Cameron Ayer	3b4b5f45c7	satellite: replace references to Suspended with UnknownAuditSuspended Change-Id: I3d2d00c95954c0546ad077702617895f262926ef	2020-06-23 14:19:22 +00:00
Egon Elbre	410d897840	satellite: fix string(int) conversions Change-Id: I54c6ca8c2dad3c321175f72271b7536cc2a4df09	2020-06-12 06:41:34 +00:00
Cameron Ayer	26fce54b11	satellite/audit: increase MinDownloadTimeout in TestReverifySlowDownload This test failed due to a timeout on a download which is supposed to succeed. The testplanet default for the value is 5 seconds, but here it is 500 milliseconds. It looks like this is due to the fact that later in the test we need to wait for a slow node to timeout, so we cut the timeout shorter to reduce test time. This PR increases the timeout to 1 second. Still not too long to wait, but gives us twice as much time to download, decreasing the likelihood that we see the timeout error. Change-Id: I504db39ab5dc4d3c505520337b258265d6da7020	2020-06-10 17:43:50 +00:00
Jeff Wendling	943eb872d3	satellite/audit: depend less on details of some error message the error message changed when we removed spacemonkeygo/errors Change-Id: I4904a072cfd84e4c39c881b58669325bcf51df46	2020-06-05 10:39:05 -06:00
Michal Niewrzal	84892631c8	private/testplanet: remove old libuplink from testplanet Change-Id: Ib1553f84d0b3ae12a5b00382f0f53357b6a273e2	2020-05-28 13:50:23 +00:00
Jennifer Johnson	03e5f922c3	satellite/overlay: updates node with a vetted_at timestamp if they meet the vetting criteria What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count are greater than the configured thresholds), we set vetted_at to the current timestamp. Why: We may want to use this timestamp in future development to select new vs vetted nodes. It also allows flexibility in node vetting experiments and allows for better metrics around vetting times. Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and commits another variable to the db (vetted_at), but this should be negligible. Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d	2020-05-20 16:30:26 -04:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
Egon Elbre	bcd93ee375	private/testplanet: add StopNodeAndUpdate This was commonly used and code with it can be simplified. Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39	2020-05-08 13:02:19 +00:00
Egon Elbre	678b859172	satellite/overlay: remove MinimumRequiredNodes In non-test code we were only using RequestedCount, not need to have MinimumRequiredNodes. Change-Id: I40736f4b028b41e94abfdeb221bce5aa86a5cb82	2020-05-07 15:41:23 +00:00
Egon Elbre	4e94da3fda	satellite/overlay: add feature flag for node selection cache Also distinguish the purpose for selecting nodes to avoid potential confusion, what should allow caching and what shouldn't. Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61	2020-05-06 16:13:47 +03:00
Jennifer Johnson	18078bf7ee	satellite/audit: increases audit worker concurrency to 2 Change-Id: Ibe3e3801b79accffbcfe9e2e02c96fc963894a7f	2020-05-05 11:31:55 +00:00
Moby von Briesen	8f60cfc4fb	satellite/overlay: Add flag for enabling/disabling disqualification from suspension mode Add a flag that allows us to easily switch disqualification from suspension mode on or off. A node will only be disqualified from suspension mode if it has been suspended for longer than the grace period AND the SuspensionDQEnabled flag is true. Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792	2020-05-04 17:25:09 +00:00
Isaac Hess	5a85e8d749	satellite/audit: Fix flaky TestVerifierSlowDownload TestVerifierSlowDownload would sometimes not have enough nodes finish in the allotted deadline period. This increases the deadline and also does not assert that exactly 3 have finished. Instead, in keeping with the purpose of the test, it asserts that the slow download is never counted as a success and is always counted as a pending audit in the final report. Change-Id: I180734fcc4a499420c75164bad6253ed155d87de	2020-04-30 15:58:47 -06:00
Moby von Briesen	de366537a8	satellite/satellitedb/overlaycache: fix behavior around gracefully exited nodes Sometimes nodes who have gracefully exited will still be holding pieces according to the satellite. This has some unintended side effects currently, such as nodes getting disqualified after having successfully exited. * When the audit reporter attempts to update node stats, do not update stats (alpha, beta, suspension, disqualification) if the node has finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated) * Treat gracefully exited nodes as "not reputable" so that the repairer and checker do not count them as healthy (overlay/statdb_test.go TestKnownUnreliableOrOffline, repair/repair_test.go TestRepairGracefullyExited) Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272	2020-04-28 23:58:43 +00:00
Jess G	825226c98e	satellite/overlay: use node selection cache for uploads (#3859 ) * satellite/overlay: use node selection cache for uploads Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad * fix config lock Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd * start fixing tests Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf * fix test, add some more Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722 * change config name Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b * add benchmarks Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9 * revert bench to put in different PR Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade * add staleness to benchmark Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870 * add cache config to testplanet Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff * have repair select old way Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252 * lower testplante config time Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7 * fix test Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69	2020-04-24 09:11:04 -07:00
Moby von Briesen	72b93f3120	satellite/satellitedb: disqualify suspended nodes when the grace period passes If a node is suspended and receives an unknown or failing audit, disqualify them if the grace period (default 1w in production) has passed. Migrate the nodes table so any node that is currently suspended gets unsuspended when the satellite starts up. Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82	2020-04-22 15:45:00 -04:00
Moby von Briesen	d7794a4851	satellite/overlay: hardcode default values for audit alpha/beta Alpha=1 and beta=0 are the expected first values for any alpha/beta reputation system we are using in the codebase. So we are removing the configurability of these values. Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3	2020-04-14 19:12:40 +00:00
Cameron Ayer	02613407ae	satellite/satellitedb: only suspend node if not already suspended Whenever the node's reputation is updated, if its unknown audit reputation is below the suspension threshold, its suspension field is set to the current time. This could overwrite the previous "suspendedAt" value resulting a node that never reaches the end of its suspension. Also log whenever a node is disqualified or its suspension status changes Change-Id: I5e8c8f1c46f66d79cb279b5b16a84fe03f533deb	2020-04-10 09:37:37 +00:00
Egon Elbre	11a44cdd88	all: don't depend on gogo/proto directly Change-Id: I8822dea0d1b7b99e0b828e0373a0308a42dde2be	2020-04-08 17:32:15 +00:00
paul cannon	0c8c11b251	satellite/audit: add not_enough_shares_for_audit counter We have been using the SQL expression `name='(*Verifier).Verify' AND error_name='not enough shares for successful audit'` thus far to detect cases of this problem and alert on them. Unfortunately, since this rarely (hopefully never) happens, influxdb has no data for most of the auditor instances, and when it has no data for a time series, it returns no columns either. This makes Redash upset when it tries to perform a query for an alert and can't find the column whose value it expects to check. This change should make it so zero values are reported when the problem has not happened, and higher values when it has. Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1	2020-04-03 17:00:50 +00:00
littleskunk	23e5a0471f	satellite/audit: clean up logging (#3832 ) Co-authored-by: Ivan Fraixedes <ivan@fraixed.es>	2020-03-30 12:09:50 -06:00
Egon Elbre	cb781d66c7	satellite/overlay: optimize FindStorageNodes Reduce the number of fields returned from the query. Benchmark results in `satellite/overlay`: benchstat before.txt after2.txt name old time/op new time/op delta SelectStorageNodes-32 7.85ms ± 1% 6.27ms ± 1% -20.18% (p=0.002 n=10+4) SelectNewStorageNodes-32 8.21ms ± 1% 6.61ms ± 0% -19.53% (p=0.002 n=10+4) SelectStorageNodesExclusion-32 17.2ms ± 1% 15.9ms ± 1% -7.55% (p=0.002 n=10+4) SelectNewStorageNodesExclusion-32 17.8ms ± 2% 16.1ms ± 0% -9.38% (p=0.002 n=10+4) FindStorageNodes-32 48.4ms ± 1% 45.1ms ± 0% -6.69% (p=0.002 n=10+4) FindStorageNodesExclusion-32 79.2ms ± 1% 76.1ms ± 1% -3.89% (p=0.002 n=10+4) Benchmark results from `satellite/overlay` after making them parallel: benchstat before-parallel.txt after2-parallel.txt name old time/op new time/op delta SelectStorageNodes-32 548µs ± 1% 353µs ± 1% -35.60% (p=0.029 n=4+4) SelectNewStorageNodes-32 562µs ± 0% 368µs ± 0% -34.51% (p=0.029 n=4+4) SelectStorageNodesExclusion-32 1.02ms ± 1% 0.84ms ± 0% -18.08% (p=0.029 n=4+4) SelectNewStorageNodesExclusion-32 1.03ms ± 1% 0.86ms ± 2% -16.22% (p=0.029 n=4+4) FindStorageNodes-32 3.11ms ± 0% 2.79ms ± 1% -10.27% (p=0.029 n=4+4) FindStorageNodesExclusion-32 4.75ms ± 0% 4.43ms ± 1% -6.56% (p=0.029 n=4+4) Change-Id: I1d85e2764eb270f4c2b1998303ccfc1179d65b26	2020-03-30 18:36:23 +03:00
Egon Elbre	e1a443b04a	private/testplanet: allow modifying created database Instead of providing the database from outside to testplanet create it inside and then allow wrapping and modifying it. This is more convenient to use. Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c	2020-03-27 19:14:48 +00:00
Egon Elbre	e8f18a2cfe	private/testplanet: expose storagenode and satellite Config Change-Id: I80fe7ed8ef7356948879afcc6ecb984c5d1a6b9d	2020-03-27 17:01:25 +02:00
Moby von Briesen	2f991b6c56	satellite/{overlay, satellitedb}: account for `suspended` field in overlay cache Make sure that suspended nodes are treated appropriately by the overlay cache. This means we should expect the following behavior: * suspended nodes (vetted or not) should not be selected for uploading new segments * suspended nodes should be treated by the checker and repairer as "unhealthy", and should be removed upon successful repair This commit also removes unused overlay functionality. Fixes a bug with commit `8b72181a1f` where the audit reporter was automatically suspending nodes regardless of audit outcome (see test added). Tests: * updates repair tests to ensure that a suspended node is treated as unhealthy and will be removed from the pointer on successful repair * updates overlay tests for KnownUnreliableOrOffline and KnownReliable to expect suspended nodes to be considered "unreliable" * adds satellitedb test that ensures overlay.SelectStorageNodes and overlay.SelectNewStorageNodes do not include suspended nodes * adds audit reporter test to ensure that different audit outcomes result in the correct suspended/disqualified states Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561	2020-03-17 17:14:56 +00:00
Moby von Briesen	8b72181a1f	satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension * change overlay.UpdateStats to allow a third audit outcome. Now it can handle successful, failed, and unknown audits. * when "unknown audit reputation" (unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the DQ threshold, put node into suspension. * when unknown audit reputation goes above the DQ threshold, remove node from suspension. * record unknown audits from audit reporter. * add basic tests around unknown audits and suspension. Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1	2020-03-16 20:29:26 +00:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Jess G	39cb821196	satellite/overlay: rm combinedcache, fix IP naming to be network (#3798 ) * rn combinedcache, rm dns node lookup Change-Id: I239f07211764b097d851230d8c81900a47756e9e * excludeIPs -> excludedNetworks Change-Id: Ifa6f44ab17457cdd5aff4cd5694296867c18b179 * use lowercase var name Change-Id: I825aad2b718c71f455e747be18f8cabd02aabe55 * update Getnetwork name Change-Id: I002a1b7bc6b4ef40159c0cd2b0ef209f80a9c503 * fix comments Change-Id: Ibddf5b9ffa9d685af6c392d893db063ef18e45fa * update comments with ipv6 Change-Id: I31758b7d4979e7c27d014668f4fb532ad838cda2 Co-authored-by: Stefan Benten <mail@stefan-benten.de>	2020-03-12 11:37:57 -07:00
Jennifer Johnson	0d60c1a4b2	satellite/audit: fix checkSegmentAltered to detect segments that have changed during an audit - Previously, checkSegmentAltered only checked for segments that were replaced but we want to detect all changes to a segment that occurred while an audit was being conducted. - Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified segments were not being removed from containment mode. - Filled in gaps in reverify testing to ensure nodes are properly removed from containment. Change-Id: Icd96d369278987200fd28581395725438972b292	2020-03-05 19:05:39 +00:00
Jennifer Johnson	1c1750e6be	removes bandwidth limiting On satellite, remove all references to free_bandwidth column in nodes table. On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated. Protobuf message, NodeCapacity, is left intact for backwards compatibility. Once this is released to all satellites, we can drop the column from the DB. Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838	2020-03-04 14:04:00 +00:00
Moby von Briesen	6043d01c90	satellite/audit/verifier: add metric for number of successfully downloaded shares Change-Id: Ia4f1dc6e088db802e340aaecf80cc7ef6dc237a4	2020-02-27 14:33:59 +00:00
Egon Elbre	5342dd9fe6	go.mod: update uplink Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811	2020-02-21 16:08:12 +02:00
Cameron Ayer	b22bf16b35	satellite/overlay: add config flag for node selection free disk requirement Currently SNs report their free disk space once per hour. If a node becomes full, it has to wait until the next contact cycle begins to report; all the while receiving and failing upload requests. By increasing the minimum required disk space, we can give the storage nodes more time to report their space before the completely fill up. This change goes hand-in-hand with another change we want to implement: trigger capacity report on SN immediately upon falling below threshold. Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27	2020-02-11 18:08:25 +00:00
Michal Niewrzal	426c8eb31a	private/testplanet: add DeleteBucket method for uplink New method added to be able to delete easily bucket during tests. Change-Id: Iaae89618cc676ddbbbd4b0df2eeacd143ea6f3c2	2020-02-11 15:58:13 +00:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Egon Elbre	8dea4f52db	satellite: add control panel Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b	2020-01-30 08:06:43 -05:00
Michal Niewrzal	6502454947	satellite/metainfo: move RS configuration to satellite With this change RS configuration will be set on satellite. Uplink with get RS values with BeginObject request and will use it. For backward compatibility and to avoid super large change redundancy scheme stored with bucket is not touched. This can be done in future. Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff	2020-01-22 09:33:53 +00:00
Yingrong Zhao	76ee8a1b4c	satellite: remove UptimeReputation configs from codebase With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and UptimeReputationDQ. This is the first step of removing the uptime reputation columns from satellitedb Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36	2020-01-08 18:54:15 +00:00
Egon Elbre	082ec81714	uplink: move to storj.io/uplink (#3746 )	2020-01-08 15:40:19 +02:00
Egon Elbre	f41d440944	all: reduce number of log messages Remove starting up messages from peers. We expect all of them to start, if they don't, then they should return an error why they don't start. The only informative message is when a service is disabled. When doing initial database setup then each migration step isn't informative, hence print only a single line with the final version. Also use shorter log scopes. Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93	2020-01-06 19:03:46 +00:00
Egon Elbre	2680bae88c	private/testplanet: remove dependency to uplink Remove direct dependency on uplink.RSConfig, this simplifies moving the config file without introducing weird dependencies. Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec	2020-01-02 09:40:46 +00:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
paul cannon	af24581ac0	satellite/audit: do not report offline to overlay (#3547 )	2019-12-18 04:51:24 -06:00
Moby von Briesen	ab777e823e	do not update pointer for failed audits Change-Id: If88dce8928db28d6f53c3dc771e14ea97aae9661	2019-12-16 10:50:54 -05:00
Egon Elbre	72d407559e	satellite/metainfo: don't leak error implementation detail (#3722 ) * satellite/metainfo: don't leak implementation detail * add missing wrap	2019-12-10 15:21:30 -05:00
Jeff Wendling	17b057b33e	satellite/audit: monitor worker function Change-Id: I94d1161deffe4ea9782abee1afbb5735f18aab44	2019-11-25 17:58:13 +00:00
Maximillian von Briesen	8653dda2b1	satellite/audit: do not contain nodes for unknown errors (#3592 ) * skip unknown errors (wip) * add tests to make sure nodes that time out are added to containment * add bad blobs store * call "Skipped" "Unknown" * add tests to ensure unknown errors do not trigger containment * add monkit stats to lockfile * typo * add periods to end of bad blobs comments	2019-11-19 17:30:28 +01:00
littleskunk	8b3444e088	satellite/nodeselection: don't select nodes that haven't checked in for a while (#3567 ) * satellite/nodeselection: dont select nodes that havent checked in for a while * change testplanet online window to one minute * remove satellite reconfigure online window = 0 in repair tests * pass timestamp into UpdateCheckIn * change timestamp to timestamptz * edit tests to set last_contact_success to 4 hours ago * fix syntax error * remove check for last_contact_success > last_contact_failure in IsOnline	2019-11-15 23:43:06 +01:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00

1 2

86 Commits