* add overall failure percentage check and inactive time frame check before sending a response to sno
* update comment
* delete node from transfer queue if it has been inactive for too long
* fix linting error
* add test config value
* fix nil pointer
* add config value into testplanet
* add unit test for overall failure threshold
* move timeframe threshold to chore
* update protolock
* add chore test
* add per peiece failure count logic
* change config name from EndpointMaxFailures to MaxFailuresPerPiece
* address comments
* fix linting error
* add error handling for no row returned from progress table
* fix test for graceful exit chore on storagenode
* fix typo InActive -> Inactive
* improve readability for failure threshold calculation
* update config lock
* change error handling for GetProgress in graceful exit endpoint on the satellite side
* return proper rpc error in endpoint
* add check in chore test for checking finish timestamp and queue
libuplink was incorrectly setting timeouts to 10 seconds still, but
should have been at least 10 minutes. the order sender was setting them
to 1 hour. we don't want timeouts in uplink-side logic as it establishes
a minimum rate on tcp streams.
instead of all of this, just use tcp keep alive. tcp keep alive packets are
sent every 15 seconds and if the peer stops responding the connection
dies. this is enabled by default with go. this will kill tcp connections
when they stop working.
Change-Id: I3d7ad49f71950b3eb43044eedf4b17993116045b
The upload code currently updates the usage in a deferred call to saveOrder().
The consequence is that in the success case, the RPC is completed before
the usage has been updated.
This change repurposes the deferred call to update usage in the
failure case, while explicitly updating the usage before completing the
RPC.
This fixes some test flakiness when using dRPC. gRPC waits until the final status is written before a Recv call completes, and the final status is written by the server after the handler function has exited. In practice this means that the client is blocked until the defer call is also finished. So this change will not change performance at all.
It has two advantages:
(1) It fixes test flakiness
and, more importantly:
(2) reduces the chances that someone will accidentally write a flaky test in the future
* add exit-status command
* remove todo and fix format
* fix status display
* change startExit to exit progress
* fix linting error
* add successful column in exit progress
* fix test
* remove extra new line
* fix TYPOS
* format the percentage better
This change adds a trusted registry (via the source code) of node address to node id mappings (currently only for well known Satellites) to defeat MITM attacks to Satellites. It also extends the uplink UI such that when entering a satellite address by hand, a node id prefix can also be added to defeat MITM attacks with unknown satellites.
When running uplink setup, satellite addresses can now be of the form 12EayRS2V1k@us-central-1.tardigrade.io (not even using a full node id) to ensure that the peer contacted is the peer that was expected. When using a known satellite address, the known node ids are used if no override is provided.
the net package does not make it easy to know if DialContext
failed because the context was done. it's important for some
of our tests that canceled contexts are detected as such, so
we accept the small race that's arguably correct (the context
must be canceled asynchronously) to ensure we always return
the context error if available.
Change-Id: I058064d5c666e5353b74fb5bd300bf7abe537ff5
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.
most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.
a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.
Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
* storagenode/storagenodedb: Migrate to separate dbs
* storagenode/storagenodedb: Add migration to drop versions tables
* Put drop table statements into a transaction.
* Fix CI errors.
* Fix CI errors.
* Changes requested from PR feedback.
* storagenode/storagenodedb: fix tx commit
* test that all nodes can check in with all satellites
* keep kademlia config
* add untrusted satellite test
* use getversion
* remove kademlia config changes in test-sim-backwards.sh
* add kademlia flags back to storj-sim storagenode
* reset kademlia flags in storagenode entrypoint
What:
cmd/inspector/main.go: removes kad commands
internal/testplanet/planet.go: Waits for contact chore to finish
satellite/contact/nodesservice.go: creates an empty nodes service implementation
satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value
satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover()
satellite/peer.go: sets up contact service and endpoints
storagenode/console/service.go: replaces nodeID with contact.Local()
storagenode/contact/chore.go: replaces routing table with contact service
storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method
storagenode/contact/service.go: creates a service to return the local node and update its own capacity
storagenode/monitor/monitor.go: uses contact service in place of routing table
storagenode/operator.go: moves operatorconfig from kad into its own setup
storagenode/peer.go: sets up contact service, chore, pingstats and endpoints
satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection
Removes kademlia setups in:
cmd/storagenode/main.go
cmd/storj-sim/network.go
internal/testplane/planet.go
internal/testplanet/satellite.go
internal/testplanet/storagenode.go
satellite/peer.go
scripts/test-sim-backwards.sh
scripts/testdata/satellite-config.yaml.lock
storagenode/inspector/inspector.go
storagenode/peer.go
storagenode/storagenodedb/database.go
Why: Replacing Kademlia
Please describe the tests:
• internal/testplanet/planet_test.go:
TestBasic: assert that the storagenode can check in with the satellite without any errors
TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup
• satellite/contact/contact_test.go:
TestFetchInfo: Tests that the FetchInfo method returns the correct info
• storagenode/contact/contact_test.go:
TestNodeInfoUpdated: tests that the contact chore updates the node information
TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info
Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).
* create upsert query for check-in method
* add tests
* fix lint err
* add benchmark test for db query
* fix lint and tests
* add a unit test, fix lint
* add address to tests
* replace print w/ b.Fatal
* refactor query per CR comments
* fix disqualified, only set if null
* fix query
* add version to updatecheckin query
* fix version
* fix tests
* change version for tests
* add version to tests
* add IP, add transport, mv unit test
* use node.address as arg
* add last ip
* fix lint
* Split the info.db database into multiple DBs using Backup API.
* Remove location. Prev refactor assumed we would need this but don't.
* Added VACUUM to reclaim space after splitting storage node databases.
* Added unique names to SQLite3 connection hooks to fix testplanet.
* Moving DB closing to the migration step.
* Removing the closing of the versions DB. It's already getting closed.
* Swapping the database connection references on reconnect.
* Moved sqlite closing logic away from the boltdb closing logic.
* Moved sqlite closing logic away from the boltdb closing logic.
* Remove certificate and vouchers from DB split migration.
* Removed vouchers and bumped up the migration version.
* Use same constructor in tests for storage node databases.
* Use same constructor in tests for storage node databases.
* Adding method to access underlining SQL database connections and cleanup
* Adding logging for migration diagnostics.
* Moved migration closing database logic to minimize disk usage.
* Cleaning up error handling.
* Fix missing copyright.
* Fix linting error.
* Add test for migration 21 (#3012)
* Refactoring migration code into a nicer to use object.
* Refactoring migration code into a nicer to use object.
* Fixing broken migration test.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Fixed bug where an invalid database path was being opened.
* Fixed linting errors.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Fix migration test. NOTE: This change does not address new tables satellites and satellite_exit_progress
* Removing v22 migration to move into it's own PR.
* Removing v22 migration to move into it's own PR.
* Refactored schema, rebind and configure functions to be re-useable.
* Renamed LegacyInfoDB to DeprecatedInfoDB.
* Cleaned up closeDatabase function.
* Renamed storageNodeSQLDB to migratableDB.
* Switched from using errs.Combine() to errs.Group in closeDatabases func.
* Removed constructors from storage node data access objects.
* Reformatted usage of const.
* Fixed broken test snapshots.
* Fixed linting error.
this is a trivial operation for storagenode/console, as it doesn't
really need or use kademlia in the first place.
What:
Removes kademlia from storagenode/console
Why:
We are in the process of getting rid of kademlia, and this is one place where it's particularly easy.
Please describe the tests:
Existing tests exercise storagenode/console behavior; if they continue to work, everything here should be tested satisfactorily.
Please describe the performance impact:
None