the net package does not make it easy to know if DialContext
failed because the context was done. it's important for some
of our tests that canceled contexts are detected as such, so
we accept the small race that's arguably correct (the context
must be canceled asynchronously) to ensure we always return
the context error if available.
Change-Id: I058064d5c666e5353b74fb5bd300bf7abe537ff5
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.
most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.
a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.
Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
* storagenode/storagenodedb: Migrate to separate dbs
* storagenode/storagenodedb: Add migration to drop versions tables
* Put drop table statements into a transaction.
* Fix CI errors.
* Fix CI errors.
* Changes requested from PR feedback.
* storagenode/storagenodedb: fix tx commit
* test that all nodes can check in with all satellites
* keep kademlia config
* add untrusted satellite test
* use getversion
* remove kademlia config changes in test-sim-backwards.sh
* add kademlia flags back to storj-sim storagenode
* reset kademlia flags in storagenode entrypoint
What:
cmd/inspector/main.go: removes kad commands
internal/testplanet/planet.go: Waits for contact chore to finish
satellite/contact/nodesservice.go: creates an empty nodes service implementation
satellite/contact/service.go: implements Local and FetchInfo methods & adds external address config value
satellite/discovery/service.go: replaces kad.FetchInfo with contact.FetchInfo in Refresh() & removes Discover()
satellite/peer.go: sets up contact service and endpoints
storagenode/console/service.go: replaces nodeID with contact.Local()
storagenode/contact/chore.go: replaces routing table with contact service
storagenode/contact/nodesservice.go: creates empty implementation for ping and request info nodes service & implements RequestInfo method
storagenode/contact/service.go: creates a service to return the local node and update its own capacity
storagenode/monitor/monitor.go: uses contact service in place of routing table
storagenode/operator.go: moves operatorconfig from kad into its own setup
storagenode/peer.go: sets up contact service, chore, pingstats and endpoints
satellite/overlay/config.go: changes NodeSelectionConfig.OnlineWindow default to 4hr to allow for accurate repair selection
Removes kademlia setups in:
cmd/storagenode/main.go
cmd/storj-sim/network.go
internal/testplane/planet.go
internal/testplanet/satellite.go
internal/testplanet/storagenode.go
satellite/peer.go
scripts/test-sim-backwards.sh
scripts/testdata/satellite-config.yaml.lock
storagenode/inspector/inspector.go
storagenode/peer.go
storagenode/storagenodedb/database.go
Why: Replacing Kademlia
Please describe the tests:
• internal/testplanet/planet_test.go:
TestBasic: assert that the storagenode can check in with the satellite without any errors
TestContact: test that all nodes get inserted into both satellites' overlay cache during testplanet setup
• satellite/contact/contact_test.go:
TestFetchInfo: Tests that the FetchInfo method returns the correct info
• storagenode/contact/contact_test.go:
TestNodeInfoUpdated: tests that the contact chore updates the node information
TestRequestInfoEndpoint: tests that the Request info endpoint returns the correct info
Please describe the performance impact: Node discovery should be at least slightly more performant since each node connects directly to each satellite and no longer needs to wait for bootstrapping. It probably won't be faster in real time on start up since each node waits a random amount of time (less than 1 hr) to initialize its first connection (jitter).
* create upsert query for check-in method
* add tests
* fix lint err
* add benchmark test for db query
* fix lint and tests
* add a unit test, fix lint
* add address to tests
* replace print w/ b.Fatal
* refactor query per CR comments
* fix disqualified, only set if null
* fix query
* add version to updatecheckin query
* fix version
* fix tests
* change version for tests
* add version to tests
* add IP, add transport, mv unit test
* use node.address as arg
* add last ip
* fix lint
* Split the info.db database into multiple DBs using Backup API.
* Remove location. Prev refactor assumed we would need this but don't.
* Added VACUUM to reclaim space after splitting storage node databases.
* Added unique names to SQLite3 connection hooks to fix testplanet.
* Moving DB closing to the migration step.
* Removing the closing of the versions DB. It's already getting closed.
* Swapping the database connection references on reconnect.
* Moved sqlite closing logic away from the boltdb closing logic.
* Moved sqlite closing logic away from the boltdb closing logic.
* Remove certificate and vouchers from DB split migration.
* Removed vouchers and bumped up the migration version.
* Use same constructor in tests for storage node databases.
* Use same constructor in tests for storage node databases.
* Adding method to access underlining SQL database connections and cleanup
* Adding logging for migration diagnostics.
* Moved migration closing database logic to minimize disk usage.
* Cleaning up error handling.
* Fix missing copyright.
* Fix linting error.
* Add test for migration 21 (#3012)
* Refactoring migration code into a nicer to use object.
* Refactoring migration code into a nicer to use object.
* Fixing broken migration test.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Fixed bug where an invalid database path was being opened.
* Fixed linting errors.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Fix migration test. NOTE: This change does not address new tables satellites and satellite_exit_progress
* Removing v22 migration to move into it's own PR.
* Removing v22 migration to move into it's own PR.
* Refactored schema, rebind and configure functions to be re-useable.
* Renamed LegacyInfoDB to DeprecatedInfoDB.
* Cleaned up closeDatabase function.
* Renamed storageNodeSQLDB to migratableDB.
* Switched from using errs.Combine() to errs.Group in closeDatabases func.
* Removed constructors from storage node data access objects.
* Reformatted usage of const.
* Fixed broken test snapshots.
* Fixed linting error.
this is a trivial operation for storagenode/console, as it doesn't
really need or use kademlia in the first place.
What:
Removes kademlia from storagenode/console
Why:
We are in the process of getting rid of kademlia, and this is one place where it's particularly easy.
Please describe the tests:
Existing tests exercise storagenode/console behavior; if they continue to work, everything here should be tested satisfactorily.
Please describe the performance impact:
None
* implement contact.checkin method
* add batching to update uptime checks
* rm batching
* rm other unneeded things
* fix lint
* fix unit test
* changes per CR comments
* couple more CR changes
* add identity check into grpcOpt
* fix lint
* why do you fix the test
* revert test change
* stop contact chore for repair test
* put node in cache
* comment out contact chore. See what happens
* Revert "comment out contact chore. See what happens"
This reverts commit 2e45008e36a50e0a842ae455ac83de77093d4daa.
* try stopping contact earlier
* stop contact chore in uplink_test
* replace self on chore with *RoutingTable for access to latest node info
* Revert "stop contact chore in uplink_test"
This reverts commit 302db70f4071112d1b9f7ee0279225ea12757723.
* Revert "try stopping contact earlier"
This reverts commit 806cc3b82f9d598899dafd83da9315a1cb0cb43c.
* Revert "stop contact chore for repair test"
This reverts commit dd34de1cfdfc09b972186c9ab9a4f1e822446b79.
Don't return error when archiving errors which aren't found in the DB
because it causes Storage Node send orders cycle to stop.
This was applied in the commit e47b8ed131
but the last call to orders.Archive function was missed so the errors
weren't returned when not found orders in the first call but they were
returned in the second call.
This commit address the second call for making handleBatches function
never returns error on not found orders.
* nicer flags
* fix concurrency
* add concurrent workers
* initialize things
* fix tests
* close retain service
* ensure we don't have workers working on the same satellite
* ensure things compile
* fix other compilation issues:
* concurrency changes
ran this with `go test -count=1000` and it passed all of them.
- we add a closed channel so that we can select on it with
context cancellation.
- we put a once in so we only close the channel once.
- every time the queue/running state changes, we have to broadcast
because we may want to wake up N pending Wait calls or other
concurrent workers.
- because we broadcast, we don't need to do the polling in Wait
anymore.
- ensure Run doesn't start multiple times so that we don't have
to worry about concurrent Close with multiple Runs.
- hold the lock while we start workers so that a concurrent Close
with Run can't decide that there's nothing started and exit
and then have Run start things.
- make sure to poll the closed/context channels through loops
or at the start of Run calls in case Close happens first.
- these polls should be under a mutex because they have a default
case which makes it possible to schedule such that Close hasn't
executed the channel close so it starts more work.
- cancel a local Run context when it's going to exit to make sure
that any retainPieces calls have a canceled context.
- hopefully enough comments to both check my work and help readers
digest what's going on.
Change-Id: Ida0e226a7e01e8ae64fa2c59dd5a84b04bccfbd7
* use the retain error class
Change-Id: I1511eaef135f98afd57b878e997e4c8a0d11cafc
* concurrency fixes again
- forgot to update the gc test to use the old Wait api.
- we need to drop the lock while we wait for the workers
to exit, because they may be blocked on the condition
variable
- additionally, we need to broadcast when we close the
signal channel because the state changed: they want
to wake up and exit.
Change-Id: I4204699792275260cd912f29aa73720f7d9b14b5
* undo my misguided rename
Change-Id: I6baffe1eb0434e260212c485bbcc01bed3250881
* remove pollInterval
* format paragraph more nicely
* move skew calculation into retain pieces
This PR introduces functionality for routine deletion of archived orders.
The user may specify an interval at which to run archive cleanup and a TTL for archived items. During each cleanup, all items that have reached the TTL are deleted
This archive cleanup job is combined with the order sender into a new combined orders service
* storagenode/nodestats: fix issue on 32 bit platforms
time.Duration is an int64, so casting it down to an int
can cause it to become negative, causing a panic.
Change-Id: I33da7c29ddd59be60d8deec944a25f4a025902c7
* storagenode/nodestats: fix lint issue in test
Change-Id: Ie68598d724d2cae0dc959d4877098a08f4eb9af7