storj

Author	SHA1	Message	Date
JT Olio	1966340b2a	storagenode: report fastopen support Change-Id: Ic707bce32729540391d5139e7f3a82c693bcb834	2023-06-05 15:20:13 +00:00
JT Olio	0b4b04900a	private/server: debounce noise and tls connections to support TCP_FAST_OPEN, we're considering just using two TCP connections in parallel per request, one with and one without. this allows us to safely fire both concurrently without stressing out the node too much. see https://review.dev.storj.io/c/storj/storj/+/9933 Change-Id: I9aa8a0252350db5ace04ee125bfe469203e980ec	2023-03-21 16:51:31 +00:00
JT Olio	382af95499	storagenode/contact: send noise key and settings as contact info Change-Id: I1e7a83de36d5cf16eed8874091b15af1e0b73df7	2023-01-31 21:49:20 +00:00
JT Olio	e40191afd6	storj: upgrade to use latest storj/common NodeAddress Change-Id: I5987391bcfe5f6dfd7b525698c337a4cbda9b76e	2023-01-25 01:37:26 +00:00
Clement Sam	59b37db670	storagenode: overhaul QUIC check implementation The current implementation blocks the the startup until one or none of the trusted satellites is able to reach the node via QUIC. This can cause delayed startup. Also, the quic check is done once during startup, and if there is a misconfiguration later, snos would have to restart to node. In this change, we reuse the contact service which pings the satellite periodically for node checkin. During checkin the satellite tries pinging the node back via both TCP and QUIC and reports both statuses. WIth this, we are able to get a periodic update of the QUIC status without restarting the node. Also adds the time the node was last pinged via QUIC to the tooltip on the QUIC status tab. Resolves https://github.com/storj/storj/issues/4398 Change-Id: I18aa2a8e8d44e8187f8f2eb51f398fa6073882a4	2022-11-09 03:15:57 +00:00
Clement Sam	8b40a071a0	storagenode: check if QUIC is properly configured This change adds a check to confirm if UDP port if properly configured for QUIC Resolves https://github.com/storj/storj/issues/4332 Partly resolves https://github.com/storj/storj/issues/4358 Change-Id: I9a66f26a115e48b4fcd168f50a7d0b4d81712f4e	2022-01-20 12:04:04 +00:00
Jeremy Wharton	8a070e7c25	satellite/overlay: Ignore unnecessary check-ins This prevents the database from being contacted unnecessarily, reducing load. Change-Id: Ib2420f68a20636ec35eb3dd3df8e02bd5341b419	2021-06-22 09:00:41 +00:00
Yingrong Zhao	59f443e71a	storagenode/contact: add authentication for PingNode endpoint Currently, if a node has untrusted a satellite, the satellite can still successfully ping the node. If a node decide to untrust a satellite, the satellite should also mark it as conact failed Change-Id: Idf80fa00d9849205533dd3e5b3b775b5b9686705	2021-05-12 18:15:32 +00:00
Egon Elbre	961e841bd7	all: fix error naming errs.Class should not contain "error" in the name, since that causes a lot of stutter in the error logs. As an example a log line could end up looking like: ERROR node stats service error: satellitedbs error: node stats database error: no rows Whereas something like: ERROR nodestats service: satellitedbs: nodestatsdb: no rows Would contain all the necessary information without the stutter. Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0	2021-04-29 15:38:21 +03:00
Yingrong Zhao	a3c437a7bf	satellite/contact,storagenode/contact: try ping back to nodes through QUIC We want to encourage storagenodes to open their udp port. This PR changes contact service in satellite to try to connect to nodes through QUIC. If satellite can't reach nodes through quic, it will send an error message back to nodes. On the nodes side, it will always log out error message from check in if the error message is not empty. Whether satellite can reach nodes through quic has no affect on nodes' uptime check. Change-Id: I5ebf80f921c4a6504997d83c8bd45226da9d3703	2021-04-20 19:25:37 +00:00
Egon Elbre	86e698f572	pb: use *UnimplementedServer to avoid breaking API changes Change-Id: I99a34eeb37ac4453411f273511710562a519f57a	2021-03-29 12:26:10 +03:00
Stefan Benten	494bd5db81	all: golangci-lint v1.33.0 fixes (#3985 )	2020-12-05 17:01:42 +01:00
Cameron Ayer	48d8114b3f	satellite/contact: treat pingback failure as error If the satellite fails to pingback the storage node during CheckIn an error message is returned to the node in the response, but the actual error value returned is nil. We are only checking the error. This means the node has no feedback about the failure, and the node also does not attempt to retry the connection. Change-Id: Iaed00e422ba91af573e72255cc6671ea97928eae	2020-11-16 18:26:37 +00:00
Egon Elbre	2268cc1df3	all: fix linter complaints Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95	2020-10-13 15:59:01 +03:00
Egon Elbre	0bdb952269	all: use keyed special comment Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b	2020-10-13 15:13:41 +03:00
Egon Elbre	94a09ce20b	all: add missing dots Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a	2020-08-11 17:50:01 +03:00
Egon Elbre	080ba47a06	all: fix dots Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2	2020-07-16 14:58:28 +00:00
Egon Elbre	bef84a5f9d	storagenode: remove dependency to overlay.NodeDossier This is the last dependency from storage node to satellite. Change-Id: I12f7abb91e84f823ba5af126c6e2979519838612	2020-05-21 08:37:13 +03:00
Egon Elbre	94b2b315f7	storagenode/trust: refactor GetAddress to GetNodeURL Most places now need the NodeURL rather than the ID and Address separately. This simplifies code in multiple places. Change-Id: I52621d8ca52296a8b5bf7afbc1001cf8bfb44239	2020-05-20 11:05:15 +00:00
Egon Elbre	ed627144ed	all: use DialNodeURL throughout the codebase Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c	2020-05-20 10:36:30 +00:00
Yingrong Zhao	f663906357	storagenode/contact: call return value from mon.Task() on function finish Change-Id: I5e6462acb99ac1d28b5d2518d5db8a4afe593d11	2020-04-01 23:26:14 +00:00
Jennifer Johnson	b75cbc8e24	satellite,storagenode: remove references to free bandwidth Change-Id: I42a6597544804fa9235e89ec656ebc365eb522e5	2020-03-25 22:28:34 +00:00
Yingrong Zhao	b7b19289d1	bump storj.io/common to latest Change-Id: I16e337660ce8e1ef332cc842dbf4cfa067b9b98b	2020-03-25 09:08:40 -04:00
Jennifer Johnson	1c1750e6be	removes bandwidth limiting On satellite, remove all references to free_bandwidth column in nodes table. On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated. Protobuf message, NodeCapacity, is left intact for backwards compatibility. Once this is released to all satellites, we can drop the column from the DB. Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838	2020-03-04 14:04:00 +00:00
Cameron Ayer	7244a6a84e	storagenode/{contact, piecestore}: implement low disk notification with cooldown When a storagenode begins to run low on capacity, we want to notify the satellite before completely running out of space. To achieve this, at the end of an upload request, the SN checks if its available space has fallen below a certain threshold. If so, trigger a notification to the satellites. The new NotifyLowDisk method on the monitor chore is implemented using the common/syn2.Cooldown type, which allows us to execute contact only once within a given timeframe; avoiding hammering the satellites with requests. This PR contains changes to the storagenode/contact package, namely moving methods involving the actual satellite communication out of Chore and into Service. This allows us to ping satellites from the monitor chore Change-Id: I668455748cdc6741291b61130d8ef9feece86458	2020-03-03 10:45:37 -05:00
Cameron Ayer	f22bddf122	{storagenode/contact, private/testplanet}: remove ErrFailureToStart and panic in testplanet.Start Change-Id: I252e8c9407400af7bda95a7657c8154660c3c801	2020-02-24 18:24:23 +00:00
Cameron Ayer	3e70a893dd	storagenode/{piecestore, contact}: report capacity to satellites if below specific threshold Curently, storage nodes only report their capacity to satellites once per hour. If a node fills up, it will fail all uploads until the next contact cycle begins. With these changes, at the end of an upload we check whether the MinimumDiskSpace threshold has been passed. If so, trigger the monitor chore to update the node's capacity, then trigger the contact chore to report the new capacity to the satellites Change-Id: Ie6aadaade1e2c12c87e03f8ff9059a50121380a0	2020-02-18 15:42:48 -05:00
Jeff Wendling	7999d24f81	all: use monkit v3 this commit updates our monkit dependency to the v3 version where it outputs in an influx style. this makes discovery much easier as many tools are built to look at it this way. graphite and rothko will suffer some due to no longer being a tree based on dots. hopefully time will exist to update rothko to index based on the new metric format. it adds an influx output for the statreceiver so that we can write to influxdb v1 or v2 directly. Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff	2020-02-05 23:53:17 +00:00
Egon Elbre	f41d440944	all: reduce number of log messages Remove starting up messages from peers. We expect all of them to start, if they don't, then they should return an error why they don't start. The only informative message is when a service is disabled. When doing initial database setup then each migration step isn't informative, hence print only a single line with the final version. Also use shorter log scopes. Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93	2020-01-06 19:03:46 +00:00
Egon Elbre	6615ecc9b6	common: separate repository Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a	2019-12-27 14:11:15 +02:00
Egon Elbre	d55288cf68	pkg/rpc: replace methods with direct calls to pb Change-Id: I8bd015d8d316a2c12c1daceca1d9fd257f6f57bc	2019-12-22 17:12:43 +02:00
Andrew Harding	cb89496569	storagenode/trust: wire up list into pool - also updated ping chore to pick up trust changes - fixed small typo in blueprint - fixed flags for storj-sim - wired up changes to testplanet Change-Id: I02982f3a63a1b4150b82a009ee126b25ed51917d	2019-12-13 20:32:50 +00:00
Rafael Antonio Ribeiro Gomes	da39c71d35	storagenode: add new metric satellite.request (#3610 ) * storagenode: add new metric satellite.request * storagenode: metrics fixed * switch from Counter to Meter	2019-11-19 18:11:31 -03:00
Egon Elbre	ee6c1cac8a	private: rename internal to private (#3573 )	2019-11-14 21:46:15 +02:00
Jeff Wendling	ebcd37c572	storagenode/contact: fix connection leak with contact checkin Change-Id: If86002557144d5d8dbff939d2b6a2dfec6537577	2019-11-06 18:00:09 +00:00
littleskunk	7eb6724c92	logging: unify logging around satellite ID, node ID and piece ID (#3491 ) * logging: unify logging around satellite ID, node ID and piece ID * unify segment index	2019-11-05 22:04:07 +01:00
Jennifer Li Johnson	11f0ea3258	5s (#3477 )	2019-11-04 16:20:31 -05:00
Jennifer Li Johnson	aa7d15a365	storagenode/contact: exponential backoff retries for pinging Satellites (#3372 )	2019-11-04 16:03:21 -05:00
Egon Elbre	87687938d1	storagenode/contact: fix panic in ping satellites (#3447 )	2019-11-01 16:20:53 +01:00
Jess G	4d85b11574	satellite/contact: improve errors in contact endpoints (#3356 ) * improve errors in satellite contact endpoints * add changes per CR comments * update pingback method so it still updates node table * fix err and returns * fix zap logging to be better	2019-10-30 11:57:21 -07:00
Egon Elbre	93353df4d6	internal/sync2: make Fence accept context (#3393 )	2019-10-28 16:04:31 +02:00
paul cannon	1469f7f41f	storagenode/contact: wait for UpdateSelf before start (#3332 ) When the contact chore starts running before the monitor service has provided any useful capacity data, the first outgoing contact has not-very-helpful data for the satellite. This change causes the contact chore to wait until capacity data is available. The wait should be quite short in all reasonable cases: even when a node starts with a lot of stored pieces and no cached spaceUsedDB data, new data will have been calculated and cached by the call to `peer.Storage2.CacheService.Init(ctx)` in `storagenode.cmdRun()` before `peer.Run(ctx)`. Change-Id: Ibc26d5c1fc10a23006c00bc3f13ff6cf71f8bf1d	2019-10-26 12:16:25 -05:00
Jennifer Li Johnson	b185dbbee2	satellite/discovery: remove discovery related code (#3175 )	2019-10-14 10:57:01 -04:00
Cameron	d17be58237	remove random sleep in storagenode contact (#3243 )	2019-10-11 16:44:18 -04:00
littleskunk	b2e328f118	storagenode/dashboard: update online status (#3168 )	2019-10-03 20:31:39 +02:00
Jennifer Li Johnson	29b96a666b	internal/testplanet: fix conn leak (#3132 )	2019-09-27 09:47:57 -06:00
Jeff Wendling	098cbc9c67	all: use pkg/rpc instead of pkg/transport all of the packages and tests work with both grpc and drpc. we'll probably need to do some jenkins pipelines to run the tests with drpc as well. most of the changes are really due to a bit of cleanup of the pkg/transport.Client api into an rpc.Dialer in the spirit of a net.Dialer. now that we don't need observers, we can pass around stateless configuration to everything rather than stateful things that issue observations. it also adds a DialAddressID for the case where we don't have a pb.Node, but we do have an address and want to assert some ID. this happened pretty frequently, and now there's no more weird contortions creating custom tls options, etc. a lot of the other changes are being consistent/using the abstractions in the rpc package to do rpc style things like finding peer information, or checking status codes. Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412	2019-09-25 15:37:06 -06:00
Jennifer Li Johnson	d2502bb51b	Adds tests for kad replacement and restores kad operator configs (#3094 ) * test that all nodes can check in with all satellites * keep kademlia config * add untrusted satellite test * use getversion * remove kademlia config changes in test-sim-backwards.sh * add kademlia flags back to storj-sim storagenode * reset kademlia flags in storagenode entrypoint	2019-09-20 16:02:23 -04:00
Jennifer Li Johnson	724bb44723	Remove Kademlia dependencies from Satellite and Storagenode (#2966 ) What: cmd/inspector/main.go: removes kad commands internal/testplanet/planet.go: Waits for contact chore to finish satellite/contact/nodesservice.go: creates an empty nodes service implementation satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover() satellite/peer.go: sets up contact service and endpoints storagenode/console/service.go: replaces nodeID with contact.Local() storagenode/contact/chore.go: replaces routing table with contact service storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method storagenode/contact/service.go: creates a service to return the local node and update its own capacity storagenode/monitor/monitor.go: uses contact service in place of routing table storagenode/operator.go: moves operatorconfig from kad into its own setup storagenode/peer.go: sets up contact service, chore, pingstats and endpoints satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection Removes kademlia setups in: cmd/storagenode/main.go cmd/storj-sim/network.go internal/testplane/planet.go internal/testplanet/satellite.go internal/testplanet/storagenode.go satellite/peer.go scripts/test-sim-backwards.sh scripts/testdata/satellite-config.yaml.lock storagenode/inspector/inspector.go storagenode/peer.go storagenode/storagenodedb/database.go Why: Replacing Kademlia Please describe the tests: • internal/testplanet/planet_test.go: TestBasic: assert that the storagenode can check in with the satellite without any errors TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup • satellite/contact/contact_test.go: TestFetchInfo: Tests that the FetchInfo method returns the correct info • storagenode/contact/contact_test.go: TestNodeInfoUpdated: tests that the contact chore updates the node information TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).	2019-09-19 15:56:34 -04:00
Jess G	93788e5218	remove kademlia: create upsert query to update uptime (#2999 ) * create upsert query for check-in method * add tests * fix lint err * add benchmark test for db query * fix lint and tests * add a unit test, fix lint * add address to tests * replace print w/ b.Fatal * refactor query per CR comments * fix disqualified, only set if null * fix query * add version to updatecheckin query * fix version * fix tests * change version for tests * add version to tests * add IP, add transport, mv unit test * use node.address as arg * add last ip * fix lint	2019-09-19 11:37:31 -07:00

1 2

58 Commits