storj

Author	SHA1	Message	Date
Moby von Briesen	acf8b72cd0	satellite/repair/repairer: cut off long tail when minimum number of required uploads is met This will speed up the Put step of repair by not waiting to time out for a handful of slow nodes, at the expense of a slightly less durable pointer. It will still repair to the optimal threshold, but not every node that is selected will end up in the pointer. Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651	2020-05-28 16:25:28 -04:00
Moby von Briesen	290c006a10	satellite/repair/{checker,queue}: add metric for new segments added to repair queue * add monkit stat new_remote_segments_needing_repair, which reports the number of new unhealthy segments in the repair queue since the previous checker iteration Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3	2020-05-27 06:23:47 +00:00
Egon Elbre	bef84a5f9d	storagenode: remove dependency to overlay.NodeDossier This is the last dependency from storage node to satellite. Change-Id: I12f7abb91e84f823ba5af126c6e2979519838612	2020-05-21 08:37:13 +03:00
Egon Elbre	941d10cbc3	private/testplanet: remove Peer.Local() Currently storagenode depends on overlay.NodeDossier, this is the first step in removing it. Change-Id: I034a3f1601835f8349bd41752455022e19bcc707	2020-05-20 11:05:34 +00:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
Egon Elbre	ec589a8289	all: fix comments about grpc Change-Id: Id830fbe2d44f083c88765561b6c07c5689afe5bd	2020-05-11 13:05:34 +03:00
Egon Elbre	bcd93ee375	private/testplanet: add StopNodeAndUpdate This was commonly used and code with it can be simplified. Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39	2020-05-08 13:02:19 +00:00
Moby von Briesen	de366537a8	satellite/satellitedb/overlaycache: fix behavior around gracefully exited nodes Sometimes nodes who have gracefully exited will still be holding pieces according to the satellite. This has some unintended side effects currently, such as nodes getting disqualified after having successfully exited. * When the audit reporter attempts to update node stats, do not update stats (alpha, beta, suspension, disqualification) if the node has finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated) * Treat gracefully exited nodes as "not reputable" so that the repairer and checker do not count them as healthy (overlay/statdb_test.go TestKnownUnreliableOrOffline, repair/repair_test.go TestRepairGracefullyExited) Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272	2020-04-28 23:58:43 +00:00
Jess G	825226c98e	satellite/overlay: use node selection cache for uploads (#3859 ) * satellite/overlay: use node selection cache for uploads Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad * fix config lock Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd * start fixing tests Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf * fix test, add some more Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722 * change config name Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b * add benchmarks Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9 * revert bench to put in different PR Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade * add staleness to benchmark Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870 * add cache config to testplanet Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff * have repair select old way Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252 * lower testplante config time Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7 * fix test Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69	2020-04-24 09:11:04 -07:00
Moby von Briesen	178aa8b5e0	satellite/{metainfo,repair}: Delete expired segments from metainfo * Delete expired segments in expired segments service using metainfo loop * Add test to verify expired segments service deletes expired segments * Ignore expired segments in checker observer * Modify checker tests to verify that expired segments are ignored * Ignore expired segments in segment repairer and drop from repair queue * Add repair test to verify that a segment that expires after being added to the repair queue is ignored and dropped from the repair queue Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d	2020-04-22 13:02:31 +00:00
Jess G	75b9a5971e	satellite: update log levels (#3851 ) * satellite: update log levels Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f * log version on start up for every service Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de * Use monkit for tracking failed ip resolutions Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2 * fix compile errors Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf * add request limit value to storage node rpc err Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5 * we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887 * fix build errs Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47	2020-04-15 12:32:22 -07:00
Egon Elbre	6492b13d81	all: remove old uuid Change-Id: I3a137f73456f010c37d3933dbe12cbbb840b809f	2020-04-02 19:30:36 +03:00
Egon Elbre	0a69da4ff1	all: switch to storj.io/common/uuid Change-Id: I178a0a8dac691e57bce317b91411292fb3c40c9f	2020-03-31 19:16:41 +03:00
littleskunk	048ca4558f	satellite/repair: clean up logging (#3833 ) Co-authored-by: Michal Niewrzal <michal@storj.io>	2020-03-30 11:59:56 +02:00
Egon Elbre	480ea1e4b5	satellite/repair/repairer: fix temporary file handling Change-Id: Ice1a467510737b3375c018ae37b16431c7dffe9e	2020-03-27 21:36:23 +02:00
Moby von Briesen	a933bcc99a	satellite/repair/repairer/ec.go: add option for downloading pieces onto disk instead of in memory during repair Add flag to satellite repairer, "InMemoryRepair" that allows the satellite to decide whether to download the entire segment being repaired into memory (this is what the satellite already does), or to download it into temporary files on disk that will be read from in the upload phase of repair. This should help with handling high repair traffic on satellites that cannot afford to spend 64mb of memory per repair worker. Updates tests to test repair for both in memory and to disk. Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194	2020-03-27 16:41:00 +00:00
Egon Elbre	e8f18a2cfe	private/testplanet: expose storagenode and satellite Config Change-Id: I80fe7ed8ef7356948879afcc6ecb984c5d1a6b9d	2020-03-27 17:01:25 +02:00
paul cannon	ba5991dc86	satellite/repair: add monitoring for remote_segments_healthy_percentage Change-Id: I6ad29fe1a947ac19d15e40ea33164a510eb33d4f	2020-03-17 17:45:59 +00:00
Moby von Briesen	2f991b6c56	satellite/{overlay, satellitedb}: account for `suspended` field in overlay cache Make sure that suspended nodes are treated appropriately by the overlay cache. This means we should expect the following behavior: * suspended nodes (vetted or not) should not be selected for uploading new segments * suspended nodes should be treated by the checker and repairer as "unhealthy", and should be removed upon successful repair This commit also removes unused overlay functionality. Fixes a bug with commit `8b72181a1f` where the audit reporter was automatically suspending nodes regardless of audit outcome (see test added). Tests: * updates repair tests to ensure that a suspended node is treated as unhealthy and will be removed from the pointer on successful repair * updates overlay tests for KnownUnreliableOrOffline and KnownReliable to expect suspended nodes to be considered "unreliable" * adds satellitedb test that ensures overlay.SelectStorageNodes and overlay.SelectNewStorageNodes do not include suspended nodes * adds audit reporter test to ensure that different audit outcomes result in the correct suspended/disqualified states Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561	2020-03-17 17:14:56 +00:00
Moby von Briesen	8b72181a1f	satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension * change overlay.UpdateStats to allow a third audit outcome. Now it can handle successful, failed, and unknown audits. * when "unknown audit reputation" (unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the DQ threshold, put node into suspension. * when unknown audit reputation goes above the DQ threshold, remove node from suspension. * record unknown audits from audit reporter. * add basic tests around unknown audits and suspension. Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1	2020-03-16 20:29:26 +00:00
Bill Thorp	94c11c5212	satellite: remove some unnecessary UTC() calls Fixes some easy cases of extraneous UTC() calls Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a	2020-03-13 13:49:44 +00:00
Jess G	39cb821196	satellite/overlay: rm combinedcache, fix IP naming to be network (#3798 ) * rn combinedcache, rm dns node lookup Change-Id: I239f07211764b097d851230d8c81900a47756e9e * excludeIPs -> excludedNetworks Change-Id: Ifa6f44ab17457cdd5aff4cd5694296867c18b179 * use lowercase var name Change-Id: I825aad2b718c71f455e747be18f8cabd02aabe55 * update Getnetwork name Change-Id: I002a1b7bc6b4ef40159c0cd2b0ef209f80a9c503 * fix comments Change-Id: Ibddf5b9ffa9d685af6c392d893db063ef18e45fa * update comments with ipv6 Change-Id: I31758b7d4979e7c27d014668f4fb532ad838cda2 Co-authored-by: Stefan Benten <mail@stefan-benten.de>	2020-03-12 11:37:57 -07:00
Jessica Grebenschikov	803e2930f4	satellite: use IP for all uplink operations, use hostname for audit and repairs My understanding is that the nodes table has the following fields: - `address` field which can be a hostname or an IP - `last_net` field that is the /24 subnet of the IP resolved from the address This PR does the following: 1) add back the `last_ip` field to the nodes table 2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time. 3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date 4) try to reduce confusion about hostname, ip, subnet, and address in the code base Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac	2020-03-11 09:11:40 -07:00
paul cannon	79553059cb	satellite/repair: put irreparable segments in irreparableDB Previously, we were simply discarding rows from the repair queue when they couldn't be repaired (either because the overlay said too many nodes were down, or because we failed to download enough pieces). Now, such segments will be put into the irreparableDB for further and (hopefully) more focused attention. This change also better differentiates some error cases from Repair() for monitoring purposes. Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c	2020-03-09 21:45:16 +00:00
Moby von Briesen	e4da7bd9cd	satellite/repair/checker: use repair override if available in checker and irreparable In production, the satellite is overriding the default repair threshold (35) to a higher value (52). In some places in the checker and irreparable processes, the repair threshold on the redundancy scheme is used in place of the override value. This fixes those cases. Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de	2020-03-06 15:39:53 -05:00
Bill Thorp	e99e675fb1	satellite/satellitedb: use time zones with all timestamps The migration was broken into one migration per table to reduce table locking and reduce the chances of failure due to SQL timeouts. Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them. These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE') Field with good data are migrated by adding AT TIME ZONE 'UTC' Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used. https://storjlabs.atlassian.net/browse/SM-104 Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce	2020-03-05 21:11:25 +00:00
Jennifer Johnson	1c1750e6be	removes bandwidth limiting On satellite, remove all references to free_bandwidth column in nodes table. On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated. Protobuf message, NodeCapacity, is left intact for backwards compatibility. Once this is released to all satellites, we can drop the column from the DB. Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838	2020-03-04 14:04:00 +00:00
Moby von Briesen	d5540c89a1	satellite/repair/checker: add monkit metrics for segments immediately above repair threshold Record counts for segments at health=rt+1 through health=rt+5 for every checker iteration. Change-Id: I2a00c0bc34d17beb21cacdeab4dac77f755faefe	2020-02-26 20:27:15 +00:00
Moby von Briesen	4e5a7f13c7	satellite/repair/queue: Prioritize selection of items off repair queue by segment health Add a column to the repair queue table in the satellite db for healthy piece count. When an item is selected from the repair queue, the least durable segment that has not been attempted in the past hour should be selected first. This prevents our repairer from getting stuck doing work on segments that are close to the repair threshold while allowing segments that are more unhealthy to degrade further. The migration also clears the repair queue so that the migration runs quickly and we can properly account for segment health in future repair work. We do not select items off the repair queue that have been attempted in the past six hours. This was changed from on hour to allow us time to try a wider variety of segments when the repair queue is very large. Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b	2020-02-26 09:54:16 -05:00
paul cannon	92d86fa044	satellite/repair: fix repair concurrency This new repair timeout (configured as TotalTimeout) will include both the time to download pieces and the time to upload pieces, as well as the time to pop the segment from the repair queue. This is a move from Github PR #3645. Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8	2020-02-24 19:57:09 +00:00
Egon Elbre	5342dd9fe6	go.mod: update uplink Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811	2020-02-21 16:08:12 +02:00
Cameron Ayer	b22bf16b35	satellite/overlay: add config flag for node selection free disk requirement Currently SNs report their free disk space once per hour. If a node becomes full, it has to wait until the next contact cycle begins to report; all the while receiving and failing upload requests. By increasing the minimum required disk space, we can give the storage nodes more time to report their space before the completely fill up. This change goes hand-in-hand with another change we want to implement: trigger capacity report on SN immediately upon falling below threshold. Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27	2020-02-11 18:08:25 +00:00
Michal Niewrzal	426c8eb31a	private/testplanet: add DeleteBucket method for uplink New method added to be able to delete easily bucket during tests. Change-Id: Iaae89618cc676ddbbbd4b0df2eeacd143ea6f3c2	2020-02-11 15:58:13 +00:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Moby von Briesen	006a2824ba	satellite/repair: lock monkit stats in checker and repairer Change-Id: Ia10fc8da0177389a500359ce51d21a5806f3f7b1	2020-01-30 14:09:56 +00:00
Egon Elbre	8dea4f52db	satellite: add control panel Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b	2020-01-30 08:06:43 -05:00
Michal Niewrzal	6502454947	satellite/metainfo: move RS configuration to satellite With this change RS configuration will be set on satellite. Uplink with get RS values with BeginObject request and will use it. For backward compatibility and to avoid super large change redundancy scheme stored with bucket is not touched. This can be done in future. Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff	2020-01-22 09:33:53 +00:00
Egon Elbre	f3b4bf2b7c	satellite/satellitedb/satellitedbtest: pass ctx as an argument ctx is created in most tests, instead pass in as argument to reduce code duplication. Change-Id: I466c51c008392001129c8b007c9d6b3619935ac4	2020-01-20 16:35:42 +02:00
Jeff Wendling	9da16b1d9e	satellite/satellitedb/dbx: name the package dbx everyone was importing it as dbx anyway. why should it be named satellitedb? so yeah just pass the "-p dbx" flag. Change-Id: I5efa669f4f00f196b38a9acd0d402009475a936f	2020-01-15 15:16:39 -07:00
Yingrong Zhao	76ee8a1b4c	satellite: remove UptimeReputation configs from codebase With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and UptimeReputationDQ. This is the first step of removing the uptime reputation columns from satellitedb Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36	2020-01-08 18:54:15 +00:00
Egon Elbre	082ec81714	uplink: move to storj.io/uplink (#3746 )	2020-01-08 15:40:19 +02:00
Egon Elbre	2680bae88c	private/testplanet: remove dependency to uplink Remove direct dependency on uplink.RSConfig, this simplifies moving the config file without introducing weird dependencies. Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec	2020-01-02 09:40:46 +00:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
Cameron Ayer	a4f9865b47	satellite: adds and enables cockroachdb compatibility for tests Change-Id: I85a3ad8c3b9d7e15ea8675b6c55af0002933db57	2019-12-16 22:29:25 +00:00
Egon Elbre	72d407559e	satellite/metainfo: don't leak error implementation detail (#3722 ) * satellite/metainfo: don't leak implementation detail * add missing wrap	2019-12-10 15:21:30 -05:00
littleskunk	71b58edb2c	satellite/repair: decrease repair interval Change-Id: Id9efdbfaa82521c35dc41e7a52b700522c197e77	2019-12-10 00:36:00 +00:00
Yingrong Zhao	7af42e3c10	satellite/metainfo, satellite/repair, uplink/eestream: add metric for download failed due to not enough pieces available (#3665 )	2019-12-04 16:24:36 -05:00
Jennifer Johnson	ecb960f506	private/dbutil: distinguishes between db drivers and implementations to allow for different implementations of SQL queries. Change-Id: I2dc8d1d371139aa8bc805e92a2b80b71f580fd64	2019-12-04 18:31:26 +00:00
littleskunk	c52c7275ad	satellite/repair: reduce upload timeout (#3597 )	2019-11-18 18:52:56 +01:00
littleskunk	8b3444e088	satellite/nodeselection: don't select nodes that haven't checked in for a while (#3567 ) * satellite/nodeselection: dont select nodes that havent checked in for a while * change testplanet online window to one minute * remove satellite reconfigure online window = 0 in repair tests * pass timestamp into UpdateCheckIn * change timestamp to timestamptz * edit tests to set last_contact_success to 4 hours ago * fix syntax error * remove check for last_contact_success > last_contact_failure in IsOnline	2019-11-15 23:43:06 +01:00

1 2

96 Commits