Commit Graph

1275 Commits

Author SHA1 Message Date
Ivan Fraixedes
e4a220347a
uplink: Suppress one metainfo call on delete (#3511)
Change signature of metainfo DeleteObject to get rid of an extra call to
kvmetainfo GetBucket method and eliminate one round trip to the
satellite when deleting objects.
2019-11-07 10:39:40 +01:00
Jeff Wendling
f62107d3e9
pkg/rpc: fix grpc dial timeouts (#3517)
grpc doesn't exit dials right away if the context dialer
returns an error. since that's the only spot where we were
enforcing dial timeouts, dials could just leak for an
unknown amount of time.

add a timeout above the grpc dial because that's the documented
way that grpc expected to be canceled.

Change-Id: Ic47ac61ce8a5f721510cc2c4584f63d43fe4f2d5
2019-11-06 16:42:20 -07:00
Michal Niewrzal
ab5c623ac7
cli: should return non-zero code for error (#3469) 2019-11-05 06:01:26 -08:00
Yaroslav Vorobiov
35edc2bcc3 satellite/payments: invoice creation (#3468) 2019-11-05 15:16:02 +02:00
Jeff Wendling
17e9044c0f pkg/rpc/rpcpeer: check both drpc and grpc for peers on a context
we don't know if an incoming connection is from drpc or grpc during
the migration time, so check both.

Change-Id: I2418dde8b651dcc4a23726057178465224a48103
2019-11-01 17:04:53 -06:00
JT Olio
41c0093e5b drpc: enable by default (#3452) 2019-11-01 22:43:24 +01:00
Jennifer Li Johnson
76b64b79ba
cmd/identity: allow using redis for RevocationDB (#3259) 2019-11-01 13:27:47 -04:00
Michal Niewrzal
8786a37f89
uplink/storage: use Batch to optimize upload requests (#3408) 2019-10-29 08:49:16 -07:00
Ethan Adams
e54d290d2e satellite/gracefulexit: Add signatures for success/failed exit finished messages. (#3368)
* add signatures, fix process loop bug, move delete to on success

* added tests for signatures

* PR comment updates

* fixed setting reason by default.

* updates for PR comments

* added signed failure when verificationi fails

* moved to sign_test

* fix panic

* removed testplanet from test
2019-10-25 16:36:26 -04:00
Natalie Villasana
696c567e89
satellite/gracefulexit: add piece hash validation for successful transfer (#3313) 2019-10-24 15:38:40 -04:00
Yingrong Zhao
fa1ac24e19
satellite/gracefulexit: add failure threshold check (#3329)
* add overall failure percentage check and inactive time frame check before sending a response to sno

* update comment

* delete node from transfer queue if it has been inactive for too long

* fix linting error

* add test config value

* fix nil pointer

* add config value into testplanet

* add unit test for overall failure threshold

* move timeframe threshold to chore

* update protolock

* add chore test

* add per peiece failure count logic

* change config name from EndpointMaxFailures to MaxFailuresPerPiece

* address comments

* fix linting error

* add error handling for no row returned from progress table

* fix test for graceful exit chore on storagenode

* fix typo InActive -> Inactive

* improve readability for failure threshold calculation

* update config lock

* change error handling for GetProgress in graceful exit endpoint on the satellite side

* return proper rpc error in endpoint

* add check in chore test for checking finish timestamp and queue
2019-10-24 12:24:42 -04:00
Jeff Wendling
51d5d8656a pkg/rpc: drpc connection pooling
keep a pool of connections open when dialing for drpc. this
makes it so that long lived clients (like lib/uplink's Project)
don't continue to use a bad connection forever. it also allows
for concurrent rpcs.

Change-Id: If649b286050e4f09c413fadc3e1ce88f5fc6e600
2019-10-22 18:15:24 -06:00
JT Olio
2c6fa3c5f8
pkg/rpc: remove read/write deadlines as a mechanism for request timeouts (#3335)
libuplink was incorrectly setting timeouts to 10 seconds still, but
should have been at least 10 minutes. the order sender was setting them
to 1 hour. we don't want timeouts in uplink-side logic as it establishes
a minimum rate on tcp streams.

instead of all of this, just use tcp keep alive. tcp keep alive packets are
sent every 15 seconds and if the peer stops responding the connection
dies. this is enabled by default with go. this will kill tcp connections
when they stop working.

Change-Id: I3d7ad49f71950b3eb43044eedf4b17993116045b
2019-10-22 17:57:24 -06:00
Ethan Adams
3e0d12354a
storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 (#3322) 2019-10-22 16:42:21 -04:00
Michal Niewrzal
04c2454c71
satellite/metainfo: pass streamID/segmentID between Batch request/response (#3311) 2019-10-22 03:23:22 -07:00
Bryan White
f468816f13
{internal/version,versioncontrol,cmd/storagenode-updater}: add rollout to storagenode updater (#3276) 2019-10-21 12:50:59 +02:00
Bryan White
243ba1cb17
{versioncontrol,internal/version,cmd/*}: refactor version control (#3253) 2019-10-20 09:56:23 +02:00
Egon Elbre
f929310add pkg/rpc/rpcstatus: fix drpc grpc compatibilty (#3306)
When code is compiled without -tags=drpc the statuses for drpc server
weren't handled, which meant an uplink using -tags=drpc didn't get the
correct status code.
2019-10-17 15:21:20 -04:00
Yingrong Zhao
87e3764390
storagenode/cmd: add exit-status command for graceful exit (#3264)
* add exit-status command

* remove todo and fix format

* fix status display

* change startExit to exit progress

* fix linting error

* add successful column in exit progress

* fix test

* remove extra new line

* fix TYPOS

* format the percentage better
2019-10-15 18:07:32 -04:00
Ethan Adams
37ab84355f
satellite/gracefulexit: protobuf field name updates (#3284)
rename piece_id to original_piece_id
2019-10-15 15:59:12 -04:00
Ethan Adams
1ad2ba7e3e
storagenode/gracefulexit: Add graceful exit chore and worker. (#3262)
Adds graceful exit chore and worker for V3-2614
2019-10-15 11:29:47 -04:00
Marc Schubert
93d5eeda31 Update dial.go (#3261)
What:
Bring back partial nodeID to debug.trace-out

Why:
The information is useful for interpreting the trace file and was there up drpc. I just bring it back.
https://github.com/storj/storj/blob/v0.21.3/pkg/transport/transport.go#L76

Please describe the tests:

Test 1:
Test 2:
Please describe the performance impact:
No impact.
2019-10-14 15:44:15 -06:00
JT Olio
694177e217 pkg/pb: regen gracefulexit.pb.go (#3270) 2019-10-14 17:06:04 -04:00
Jennifer Li Johnson
b185dbbee2
satellite/discovery: remove discovery related code (#3175) 2019-10-14 10:57:01 -04:00
JT Olio
6ede140df1
pkg/rpc: defeat MITM attacks in most cases (#3215)
This change adds a trusted registry (via the source code) of node address to node id mappings (currently only for well known Satellites) to defeat MITM attacks to Satellites. It also extends the uplink UI such that when entering a satellite address by hand, a node id prefix can also be added to defeat MITM attacks with unknown satellites.

When running uplink setup, satellite addresses can now be of the form 12EayRS2V1k@us-central-1.tardigrade.io (not even using a full node id) to ensure that the peer contacted is the peer that was expected. When using a known satellite address, the known node ids are used if no override is provided.
2019-10-12 14:34:41 -06:00
Ethan Adams
a1275746b4
satellite/gracefulexit: Implement the 'process' endpoint on the satellite (#3223) 2019-10-11 17:18:05 -04:00
Isaac Hess
9256399872
CI: test drpc and grpc (#3163)
* wip: test drpc

* Add parallel intregration test

* Add jenkinsfile.drpc

* Remove unnecessary jenkinsfile items

* testing: GOFLAGS=-drpc (#3236)

* Use GOFLAGS

* add debug

* revert tags

* revert changes

* move goflags to the correct place

* add sanity check
2019-10-11 08:30:06 -06:00
Yingrong Zhao
743a0fc38b storagenode/cmd: create start graceful exit CLI (#3202) 2019-10-11 09:58:12 -04:00
Ethan Adams
447c219d92
satellite/gracefulexit: Add protobuf definitions for communication between storage node and satellite (#3201) 2019-10-08 13:42:56 -04:00
Jennifer Li Johnson
7ceaabb18e
Delete Bootstrap and Kademlia (#2974) 2019-10-04 16:48:41 -04:00
Jeff Wendling
4fab22d691 pkg/rpc: don't leak goroutines during a drpc dial
we spawned a goroutine to wait on the context's done
channel sending the error afterward, but we forgot
to ensure the context was eventually done, so the
goroutine would be leaked until then.

instead, we can just do a select on two channels to
get the error rather than spawn a goroutine which
makes it impossible to leak a goroutine.

Change-Id: I2fdba206ae6ff7a3441b00708b86b36dfeece2b5
2019-10-04 20:09:36 +00:00
Jeff Wendling
64e43e555e pkg/rpc: return context error if ready after DialContext fails
the net package does not make it easy to know if DialContext
failed because the context was done. it's important for some
of our tests that canceled contexts are detected as such, so
we accept the small race that's arguably correct (the context
must be canceled asynchronously) to ensure we always return
the context error if available.

Change-Id: I058064d5c666e5353b74fb5bd300bf7abe537ff5
2019-10-04 20:09:00 +00:00
Jeff Wendling
c9e0aa7c70 pkg/kademlia: make tests run/work with drpc
Change-Id: I69372fd8f0d52913e1ad2cf7d01115460ba8eeda
2019-10-03 15:33:25 -06:00
littleskunk
b2e328f118 storagenode/dashboard: update online status (#3168) 2019-10-03 20:31:39 +02:00
Isaac Hess
94c7df0d6e
pkg/rpc/rpcstatus: Fix return type (#3162) 2019-10-02 14:46:18 -06:00
Jennifer Li Johnson
29b96a666b
internal/testplanet: fix conn leak (#3132) 2019-09-27 09:47:57 -06:00
Jeff Wendling
93349f247e pkg/rpc: add WithInsecure when doing non-tls dials
Change-Id: I993f223f4ac78824b75a7725342ebf2ae0f74254
2019-09-27 09:07:14 -06:00
Bryan White
c8aa821ccb
pkg/certificates: move certificate package to root (#3107) 2019-09-26 09:11:05 -07:00
Jeff Wendling
098cbc9c67 all: use pkg/rpc instead of pkg/transport
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.

most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.

a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.

Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
2019-09-25 15:37:06 -06:00
Bryan White
a7040647a4
run certificate authorization endpoint (#3108) 2019-09-23 15:19:13 -07:00
Jeff Wendling
d32d85a717 pkg/listenmux: resolve deadlock in test
it was possible, because we spawned Run before we did any calls
to Route, that the listenmux would send multiple connections to
the default listener. Fix that by ensuring we call Route before
we call Run.

Change-Id: Ie8fd754997975969a99fd2a3f8d3010c24cdc73d
2019-09-20 21:16:59 +00:00
Jeff Wendling
a20a7db793 pkg/rpc: build tag based selection of rpc details
It provides an abstraction around the rpc details so that one
can use dprc or gprc with the same code. It subsumes using the
protobuf package directly for client interfaces as well as
the pkg/transport package to perform dials.

Change-Id: I8f5688bd71be8b0c766f13029128a77e5d46320b
2019-09-20 21:07:33 +00:00
Jennifer Li Johnson
724bb44723
Remove Kademlia dependencies from Satellite and Storagenode (#2966)
What:

cmd/inspector/main.go: removes kad commands
internal/testplanet/planet.go: Waits for contact chore to finish
satellite/contact/nodesservice.go: creates an empty nodes service implementation
satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value
satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover()
satellite/peer.go: sets up contact service and endpoints
storagenode/console/service.go: replaces nodeID with contact.Local()
storagenode/contact/chore.go: replaces routing table with contact service
storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method
storagenode/contact/service.go: creates a service to return the local node and update its own capacity
storagenode/monitor/monitor.go: uses contact service in place of routing table
storagenode/operator.go: moves operatorconfig from kad into its own setup
storagenode/peer.go: sets up contact service, chore, pingstats and endpoints
satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection
Removes kademlia setups in:

cmd/storagenode/main.go
cmd/storj-sim/network.go
internal/testplane/planet.go
internal/testplanet/satellite.go
internal/testplanet/storagenode.go
satellite/peer.go
scripts/test-sim-backwards.sh
scripts/testdata/satellite-config.yaml.lock
storagenode/inspector/inspector.go
storagenode/peer.go
storagenode/storagenodedb/database.go
Why: Replacing Kademlia

Please describe the tests:
• internal/testplanet/planet_test.go:

TestBasic: assert that the storagenode can check in with the satellite without any errors
TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup
• satellite/contact/contact_test.go:

TestFetchInfo: Tests that the FetchInfo method returns the correct info
• storagenode/contact/contact_test.go:

TestNodeInfoUpdated: tests that the contact chore updates the node information
TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info
Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).
2019-09-19 15:56:34 -04:00
Jess G
93788e5218
remove kademlia: create upsert query to update uptime (#2999)
* create upsert query for check-in method

* add tests

* fix lint err

* add benchmark test for db query

* fix lint and tests

* add a unit test, fix lint

* add address to tests

* replace print w/ b.Fatal

* refactor query per CR comments

* fix disqualified, only set if null

* fix query

* add version to updatecheckin query

* fix version

* fix tests

* change version for tests

* add version to tests

* add IP, add transport, mv unit test

* use node.address as arg

* add last ip

* fix lint
2019-09-19 11:37:31 -07:00
Kaloyan Raev
45df0c5340
storagenode/process: respond to Windows Service events (#3025) 2019-09-19 19:37:40 +03:00
JT Olio
946ec201e2
metainfo: move api keys to part of the request (#3069)
What: we move api keys out of the grpc connection-level metadata on the client side and into the request protobufs directly. the server side still supports both mechanisms for backwards compatibility.

Why: dRPC won't support connection-level metadata. the only thing we currently use connection-level metadata for is api keys. we need to move all information needed by a request into the request protobuf itself for drpc support. check out the .proto changes for the main details.

One fun side-fact: Did you know that protobuf fields 1-15 are special and only use one byte for both the field number and type? Additionally did you know we don't use field 15 anywhere yet? So the new request header will use field 15, and should use field 15 on all protobufs going forward.

Please describe the tests: all existing tests should pass

Please describe the performance impact: none
2019-09-19 10:19:29 -06:00
Jess G
695de9dcd7
rm noisy debug logs that we dont need (#3083) 2019-09-18 12:43:57 -07:00
Egon Elbre
186e67e056 pkg/transport: set default timeout to 10 minutes (#3075) 2019-09-18 11:56:23 -04:00
Maximillian von Briesen
574c96c350
satellite/metainfo: Verify storagenode signature on satellite upload (#2985) 2019-09-18 09:50:33 -04:00
Jess G
7c203b4884
add satelliteSystem to testplanet and update tests (#3066) 2019-09-17 13:14:49 -07:00