We changed primary key for bucket_bandwidth_rollups table. Now we
need to do some cleanup in places like structs, sorting methods or SQL
queries.
Change-Id: Ida4f874f161356df193379a53507602e04db1668
Earlier we made a change to not cancel flushing orders when flushing
was triggered by orders endpoint method but we missed a case
where it can be also triggered (and canceled) by metainfo endpoints
method. This change moves ignoring context cancellation deeper.
Change-Id: Id43176f552efc3167345783f73aab885411ac247
Orders from storage nodes are received by SettlementWithWindowFinal method. There is a stream which receives all orders and after getting
all orders we are inserting into DB storagenode and bucket bandwidth. Problem is with bucket bandwidth which is stored through cache which is often using context from SettlementWithWindowFinal stream to perform DB inserts and its doing this in separate goroutine. Because of that is possible that SettlementWithWindowFinal is finished before flushing was finished and context is canceled while doing insert into DB
Change-Id: I3a72c86390e9aedc060f6b082bb059f1406231ee
For bucket_bandwidth_rollups we are trying to insert lots of entries
with empty bucket name and project id. Those are inserts from orders
created by repair, audit and GE. High load on the same primary key
(the same range) is causing many retries and that's affect all inserts
as we are putting 1000 entries into DB a once.
This change solves this problem by not storing into
bucket_bandwidth_rollups other actions then GET and PUT. Those actions
are only important from bucket bandwidth usage perspective because
those are actions performed by users. Other actions (repair, audit or
GE) are also stored in storagenode_bandwdith_rollups so we will still
have access to them e.g. for statistic purposes.
https://github.com/storj/storj/issues/5332
Change-Id: Ibb5bf0a4c869b0439dc65da1c9342a38ca2890ba
Sorting by primary key before inserting data into DB is fixed.
Earlier we were sorting input slice of BucketBandwidthRollup but then
we were putting all entries into map to rollup input data. Iteration
over map with a range loop doesn't guarantee any specific order so we
were loosing sorted order when we were creating with this map slices to
use with DB insert.
New code is also using map but when map is full its sorting map keys
separately and iterates over them to get data from map.
https://github.com/storj/storj/issues/5332
Change-Id: I5bf09489b0eecb6858bf854ab387b660124bf53f
* use the same DB application name for satellite and metabase
* use noop orders DB implementation to avoid storing allocated bandwidth
in DB
Change-Id: I20e88c694d38240fe1a20c45719e210cfb76402c
Currently the primary key of the underlying rollup table has the
primary key being the bucket name, but we used to sort by projectID.
This caused dead locks due to the contention during updates/inserts.
We should reevalute if bucket name being the primary key is the right
way for this table, this should stop the long running and failing attempts tho.
Change-Id: Ie7d0f86944da48ad9cbd92eb162226882a2fb954
Batching of the order submissions can lead to combining the allocated
traffic totals for two completely different time windows, resulting
in incorrect customer accounting. This change will group the batched
order submissions by projectID as well as time window, leading to
distinct updates of a buckets bandwidth rollup based on the hour
window in which the order was created.
Change-Id: Ifb4d67923eec8a533b9758379914f17ff7abea32
Currently most of the satellite log is made up by this specific log
message and thus we should increase the log level. We already have
a monkit event tracking this case. In case we need to look into this
more we should just increase the satellite log level.
Change-Id: I27ebed3e6745701c66c83e8c52ddc836ad9d5f4e
When we switched to segments loop we stopped
adding bucket name and project ID to order limit
metadata. This is causing lots of logging that we don't need.
Change-Id: Id3f878a170b7a6b801e8a838ee69165715985d60
Populate the egress_dead column for taking into account allocated bandwidth that can be removed because orders have been sent by the storage nodes. The bandwidth not used in these orders can be allocated again.
Change-Id: I78c333a03945cd7330aec052edd3562ec671118e
the very common case is that the node api version is
indeed at least the requested version, so query for
that first to avoid write traffic.
Change-Id: Ib047d93078205bc07fee75d1f635503b792307f0
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
Remove the orders Settlement endpoint because it isn't used and it was
already always returning an error.
Change-Id: I81486fbe7044a1444182173bc0693698ee7cfe7e
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
Avoid using project uuid string representation, because
it uses more bandwidth.
This reduces the encrypted metadata size from 118 -> 97 bytes.
Change-Id: Ic53a81b83acc065f24f28cd404f9c0b1fe592594
This PR contains the minimum changes needed to stop inserting into the serial_numbers table. This is the first step in completely deprecating that table.
The next step is to create another PR to remove the expiredSerial chore, fix more tests, and remove any other methods on the serial_number table.
Change-Id: I5f12a56ebf3fa4d1a1976141d2911f25a98d2cc3
We want to stop using the serial_numbers table in satelliteDB. One of the last places using the serial_numbers table is when storagenodes settle orders, we look up the bucket name and project ID from the serial number from the serial_numbers table.
Now that we have support to add encrypted metadata into the OrderLimit, this PR makes use of that and now attempts to read the project ID and bucket name from the encrypted orderLimit metadata instead of from the serial_numbers table. For backwards compatibility and to ensure no errors, we will still fallback to the old way of getting that info from the serial_numbers table, but this will be removed in the next release as long as there are no errors.
All processes that create orderLimits must have an orders.encryption-keys set. The services that create orderLimits (and thus need to encrypt the order metadata) are the satellite apiProcess, the repair process, audit service (core process), and graceful exit (core process). Only the satellite api process decrypts the order metadata when storagenodes settle orders. This means that the same encryption key needs to be provided in the config for the satellite api process, repair process, and the core process like so:
orders.include-encrypted-metadata=true
orders.encryption-keys="<"encryptionKeyID>=<encryptionKey>"
Change-Id: Ie2c037971713d6fbf69d697bfad7f8b672eedd66
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
We are moving an error into rejectErr since its preventing storage nodes from being able to settle other orders.
Change-Id: I3ac97c340e491b127f5e0024c5e8bd9f4df8d5c3
The same was that our Admin API handles project and account deletions currently, we would like
to have the same checks on the user-facing API. This PR adds the same checks to the console service.
General more applicable checks have been moved directly into the payments service.
In addition it adds the BucketsDB to the console DB, to have easier access and avoiding import cycles with
the metainfo package.
A small cleanup around our unnecessary monkit imports made it in as well.
Change-Id: I8769b01c2271c1687fbd2269a738a41764216e51
holding it during node i/o means slow nodes can hold up order
processing for everyone else. this dramatically increases
the amount of tiem spent handling orders.
Change-Id: Iec999b7ed0817c921a0fd039097a75bdd3c70ea2
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.
Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
nodes are submitting using both the legacy and windowed endpoints
and thus having their legacy submissions rejected.
it is legal to use both the legacy and windowed endpoints
in phase1 since they use the same backend. the legacy endpoint
is disabled in phase2 and phase3.
therefore, if we wait an order expiration period (2 days) after
we determine enough nodes have started using the windowed
endpoint, we can be sure that any orders they did have to
submit with the legacy endpoint will have expired.
Change-Id: I4418a881bf8bb9377efaef4c651e6103a5dc6ed0
This adds a config flag orders.window-endpoint-rollout-phase
that can take on the values phase1, phase2 or phase3.
In phase1, the current orders endpoint continues to work as
usual, and the windowed orders endpoint uses the same backend
as the current one (but also does a bit extra).
In phase2, the current orders endpoint is disabled and the
windowed orders endpoint continues to use the same backend.
In phase3, the current orders endpoint is still disabled and
the windowed orders endpoint uses the new backend that requires
much less database traffic and state.
The intention is to deploy in phase1, roll out code to nodes
to have them use the windowed endpoint, switch to phase2, wait
a couple days for all existing orders to expire, then switch
to phase3.
Additionally, it fixes a bug where a node could submit a bunch
of orders and rack up charges for a bucket.
Change-Id: Ifdc10e09ae1645159cbec7ace687dcb2d594c76d
Why: We need a way to cut down on database traffic due to bandwidth
measurement and tracking.
What: This changeset is the Satellite side of settling orders in 1 hr windows.
See design doc for more details: https://review.dev.storj.io/c/storj/storj/+/1732
Change-Id: I2e1c151e2e65516ebe1b7f47b7c5f83a3a220b31
there are a subset of storagenodes hammering the satellite with
expired orders. if we check for expiration first, we don't have
to do a bunch of pointless signature verification. since a && b
is equal to b && a, we can order these checks in any way we want
and have it still be correct.
Change-Id: I6ffc8025c8b0d54949a1daf5f5ea1fed9e213372
we still need to come up with a better plan to get storage nodes
to stop doing this, but in the meantime, we know this is happening,
just stop logging it and keep some stats instead.
Change-Id: Icb6bcba275e0e955c54b1a90da2b37219fff2349
by doing an indexed anti-join we're able to reduce the time to
select the pending orders by over 10x on postgres. this should
help us process pending orders much more quickly.
it probably won't do as good a job on cockroach because it does
not do an indexed anti-join and instead does a hash join after
scanning the entire consumed serials table. we should either
remove orders entirely or try to make that more efficient
when necessary.
Change-Id: I8ca0535acd21c51e74955b24c9b86d20e4f2ff9c
common/pb moved grpc to a separate package common/pb/pbgrpc.
This updates this repository to use it.
Change-Id: I2de2a190688871cf9cb61f7ea511f8a01e264e4e
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35