errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
In production we are seeing ~115 storage nodes (out of ~6,500) are not using the new SettlementWithWindow endpoint (but they are upgraded to > v1.12).
We analyzed data being reported by monkit for the nodes who were above version 1.11 but were not successfully submitting orders to the new endpoint.
The nodes fell into a few categories:
1. Always fail to list orders from the db; never get to try sending orders from the filestore
2. Successfully list/send orders from the db; never get to calling satellite endpoint for submitting filestore orders
3. Successfully list/send orders from the db; successfully list filestore orders, but satellite endpoint fails (with "unauthenticated" drpc error)
The code change here add the following to address these issues:
- modify the query for ordersDB.listUnsentBySatellite so that we no longer select expired orders from the unsent_orders table
- always process any orders that are in the ordersDB and also any orders stored in the filestore
- add monkit monitoring to filestore.ListUnsentBySatellite so that we can see the failures/successes
Change-Id: I0b473e5d75252e7ab5fa6b5c204ed260ab5094ec
Abstract details of writing and reading data to/from orders files so
that adding V1 and future maintenance are easier.
Change-Id: I85f4a91761293de1a782e197bc9e09db228933c9
This change accomplishes multiple things:
1. Instead of having a max in flight time, which means
we effectively have a minimum bandwidth for uploads
and downloads, we keep track of what windows have
active requests happening in them.
2. We don't double check when we save the order to see if it
is too old: by then, it's too late. A malicious uplink
could just submit orders outside of the grace window and
receive all the data, but the node would just not commit
it, so the uplink gets free traffic. Because the endpoints
also check for the order being too old, this would be a
very tight race that depends on knowledge of the node system
clock, but best to not have the race exist. Instead, we piggy
back off of the in flight tracking and do the check when
we start to handle the order, and commit at the end.
3. Change the functions that send orders and list unsent
orders to accept a time at which that operation is
happening. This way, in tests, we can pretend we're
listing or sending far into the future after the windows
are available to send, rather than exposing test functions
to modify internal state about the grace period to get
the desired effect. This brings tests closer to actual
usage in production.
4. Change the calculation for if an order is allowed to be
enqueued due to the grace period to just look at the
order creation time, rather than some computation involving
the window it will be in. In this way, you can easily
answer the question of "will this order be accepted?" by
asking "is it older than X?" where X is the grace period.
5. Increases the frequency we check to send up orders to once
every 5 minutes instead of once every hour because we already
have hour-long buffering due to the windows. This decreases
the maximum latency that an order will be reported back to
the satellite by 55 minutes.
Change-Id: Ie08b90d139d45ee89b82347e191a2f8db1b88036
This reverts commit 8e242cd012.
Revert because lib/pq has known issues with context cancellation.
These issues need to be resolved before these changes can be merged.
Change-Id: I160af51dbc2d67c5449aafa406a403e5367bb555
this will allow for some nice runtime analysis down the road.
also, this allows for wrapping database handles in a way that
can interact with these contexts
requires https://review.dev.storj.io/c/storj/dbx/+/514
Change-Id: Ib087b7cd73296dd2c1e0331314da34d861f61d2b
* Split the info.db database into multiple DBs using Backup API.
* Remove location. Prev refactor assumed we would need this but don't.
* Added VACUUM to reclaim space after splitting storage node databases.
* Added unique names to SQLite3 connection hooks to fix testplanet.
* Moving DB closing to the migration step.
* Removing the closing of the versions DB. It's already getting closed.
* Swapping the database connection references on reconnect.
* Moved sqlite closing logic away from the boltdb closing logic.
* Moved sqlite closing logic away from the boltdb closing logic.
* Remove certificate and vouchers from DB split migration.
* Removed vouchers and bumped up the migration version.
* Use same constructor in tests for storage node databases.
* Use same constructor in tests for storage node databases.
* Adding method to access underlining SQL database connections and cleanup
* Adding logging for migration diagnostics.
* Moved migration closing database logic to minimize disk usage.
* Cleaning up error handling.
* Fix missing copyright.
* Fix linting error.
* Add test for migration 21 (#3012)
* Refactoring migration code into a nicer to use object.
* Refactoring migration code into a nicer to use object.
* Fixing broken migration test.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Removed unnecessary code that is no longer needed now that we close DBs.
* Fixed bug where an invalid database path was being opened.
* Fixed linting errors.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Renamed VersionsDB to LegacyInfoDB and refactored DB lookup keys.
* Fix migration test. NOTE: This change does not address new tables satellites and satellite_exit_progress
* Removing v22 migration to move into it's own PR.
* Removing v22 migration to move into it's own PR.
* Refactored schema, rebind and configure functions to be re-useable.
* Renamed LegacyInfoDB to DeprecatedInfoDB.
* Cleaned up closeDatabase function.
* Renamed storageNodeSQLDB to migratableDB.
* Switched from using errs.Combine() to errs.Group in closeDatabases func.
* Removed constructors from storage node data access objects.
* Reformatted usage of const.
* Fixed broken test snapshots.
* Fixed linting error.
This PR introduces functionality for routine deletion of archived orders.
The user may specify an interval at which to run archive cleanup and a TTL for archived items. During each cleanup, all items that have reached the TTL are deleted
This archive cleanup job is combined with the order sender into a new combined orders service
* Rebasing changes against master.
* Added back withTx().
* Fix using new error type.
* Moving back database initialization back into the struct.
* Fix failing migration tests.
* Fix linting errors.
* Renamed database object names to be consistent.
* Fixing linting error in imports.
* Rebasing changes against master.
* Added back withTx().
* Fix using new error type.
* Moving back database initialization back into the struct.
* Fix failing migration tests.
* Fix linting errors.
* Renamed database object names to be consistent.
* Fixing linting error in imports.
* Adding missing change from merge.
* Fix error name.
When an unsent order stored in the DB cannot be unmarshalled due to an
unmarshal error the rest unsent orders must be processed as usual.
This changes will avoid that a Storage Node with unsent orders with
invalid protobuf serialized values get blocked without sending orders
until those invalid ones get removed from the DB.
* pkg/process: Fatal show complete error information
Change the general process execution function to not using the sugared
logger for outputting the full error information.
Delete some unreachable code because Zap logger Fatal method calls exit
1 internally.
* storagenode/storagenodedb: Add info to error
Add more information to an error returned due to some data
inconsistency.
* storagenode/orders: Don't use sugared logger
Don't use sugar logger and provide better contextualized error messages
in settle method.
* storagenode/orders: Add some log fields to error msgs
Add some relevant log fields to some logged errors of the sender settle
method.
* satellite/orders: Remove always nil error from debug
Remove an error which as logged in debug level which was always nil and
makes the logic that used this variable clear.
* storagenode/orders: Don't return error Archiving unsent
Don't stop the process which archive unsent orders if some of them
aren't found the DB because it cause the Storage Node to stop with a
fatal error.
* storagenode: remove datetime calls in favor of UTC
datetime only has second level granularity whereas string
comparisons don't. Since we're wiping everything anyway, it's
easier to just use UTC everywhere rather than migrate to
datetime calls.
* add utcdb to check that arguments are utc
* storagenodedb: add trivial tests to ensure calls work
This at least tests that all of the timestamps passed in are
in the UTC timezone.
* fix truncated comment and change migrations to be UTC
* remove infodb locks and give a unique name for each in memory created.
* changed max idle and open to 1 for memory DBs. fixes table locking errors
* fixed race condition
* added file based infodb test
* added busy timeout parameter to the file based infodb for testing
* fixed imports
* removed db.locked() after merge from master