The previous default FlushBatchSize of 10000 was causing major
slow down in select and insert statements on bucket_bandwidth_rollups.
We saw on the saltlake satellite that a FlushBatchSize of 1000 helped
reduce contention and query latency.
Change-Id: Ib95e73482219bc5aedc11925b1849fa5999774ba
Remove the orders Settlement endpoint because it isn't used and it was
already always returning an error.
Change-Id: I81486fbe7044a1444182173bc0693698ee7cfe7e
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
CreateGetOrderLimits is not used anymore because we have CreateGetOrderLimits2. We need to remove old method and fix name of second.
Change-Id: I59148b8d28fc9dbab7d452c884319125a02745d1
SatelliteAddress in OrderLimit is not being used anymore and some
satellite addresses may consume too much bytes.
Change-Id: Ic7a0efe5b6211c2f3b91af67b293cde98b29d074
Avoid using project uuid string representation, because
it uses more bandwidth.
This reduces the encrypted metadata size from 118 -> 97 bytes.
Change-Id: Ic53a81b83acc065f24f28cd404f9c0b1fe592594
Since the Satellite now requires the order encryption functionality (since serial_number table is deprecated) to properly function, we can remove the config flag to turn on/off the feature.
Change-Id: Ie973f72a9a05a81cef9e53dc9c99d22c940c2488
This PR contains the minimum changes needed to stop inserting into the serial_numbers table. This is the first step in completely deprecating that table.
The next step is to create another PR to remove the expiredSerial chore, fix more tests, and remove any other methods on the serial_number table.
Change-Id: I5f12a56ebf3fa4d1a1976141d2911f25a98d2cc3
We want to stop using the serial_numbers table in satelliteDB. One of the last places using the serial_numbers table is when storagenodes settle orders, we look up the bucket name and project ID from the serial number from the serial_numbers table.
Now that we have support to add encrypted metadata into the OrderLimit, this PR makes use of that and now attempts to read the project ID and bucket name from the encrypted orderLimit metadata instead of from the serial_numbers table. For backwards compatibility and to ensure no errors, we will still fallback to the old way of getting that info from the serial_numbers table, but this will be removed in the next release as long as there are no errors.
All processes that create orderLimits must have an orders.encryption-keys set. The services that create orderLimits (and thus need to encrypt the order metadata) are the satellite apiProcess, the repair process, audit service (core process), and graceful exit (core process). Only the satellite api process decrypts the order metadata when storagenodes settle orders. This means that the same encryption key needs to be provided in the config for the satellite api process, repair process, and the core process like so:
orders.include-encrypted-metadata=true
orders.encryption-keys="<"encryptionKeyID>=<encryptionKey>"
Change-Id: Ie2c037971713d6fbf69d697bfad7f8b672eedd66
Currently flag parsing seems to call Set twice, which causes problems
with encryption keys. We can clear for every set for now.
Change-Id: Id5c695b4020194ac1c50a2da9c7d2a896cb9216f
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
Before manipulating order information on storagenodes we need to wait
for the orders to propagate to the database. Some of that happens
async with uplink.
Change-Id: Iaacfd7db0909ab5d2831d06388e5fb27b6d4778f
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
This change is adjusting metainfo endpoint to use metabase for uploading
and downloading remote objects. Inline segments will be added later.
Change-Id: I109d45bf644cd48096c47361043ebd8dfeaea0f3
Storage nodes undergoing Graceful Exit have up to now been receiving
hostnames for all other storage nodes they need to contact when
transferring pieces. This adds up to a lot of DNS lookups, which
apparently overwhelm some home routers. There does not seem to be any
need for us to send hostnames for graceful exit as opposed to IP
addresses; we already use IP addresses (as given by the last_ip_port
column in the nodes table) for all the GET and PUT orders we send out.
This change causes IP addresses to be used instead.
I started trying to construct a test to ensure that the behavior
changed, but it was rabbit-holing, so I've begun to feel that maybe this
change doesn't require one; it is a very simple change, and very much of
the same nature as what we already do for IPs in CreateGetOrderLimits
and CreatePutOrderLimits (and others).
Change-Id: Ib2b5ffe7a9310e9cdbe7464450cc7c934fa229a1
In production we are seeing ~115 storage nodes (out of ~6,500) are not using the new SettlementWithWindow endpoint (but they are upgraded to > v1.12).
We analyzed data being reported by monkit for the nodes who were above version 1.11 but were not successfully submitting orders to the new endpoint.
The nodes fell into a few categories:
1. Always fail to list orders from the db; never get to try sending orders from the filestore
2. Successfully list/send orders from the db; never get to calling satellite endpoint for submitting filestore orders
3. Successfully list/send orders from the db; successfully list filestore orders, but satellite endpoint fails (with "unauthenticated" drpc error)
The code change here add the following to address these issues:
- modify the query for ordersDB.listUnsentBySatellite so that we no longer select expired orders from the unsent_orders table
- always process any orders that are in the ordersDB and also any orders stored in the filestore
- add monkit monitoring to filestore.ListUnsentBySatellite so that we can see the failures/successes
Change-Id: I0b473e5d75252e7ab5fa6b5c204ed260ab5094ec
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.
A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.
The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.
Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.
Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.
So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.
We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.
Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
We are moving an error into rejectErr since its preventing storage nodes from being able to settle other orders.
Change-Id: I3ac97c340e491b127f5e0024c5e8bd9f4df8d5c3
The same was that our Admin API handles project and account deletions currently, we would like
to have the same checks on the user-facing API. This PR adds the same checks to the console service.
General more applicable checks have been moved directly into the payments service.
In addition it adds the BucketsDB to the console DB, to have easier access and avoiding import cycles with
the metainfo package.
A small cleanup around our unnecessary monkit imports made it in as well.
Change-Id: I8769b01c2271c1687fbd2269a738a41764216e51
holding it during node i/o means slow nodes can hold up order
processing for everyone else. this dramatically increases
the amount of tiem spent handling orders.
Change-Id: Iec999b7ed0817c921a0fd039097a75bdd3c70ea2
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.
Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
nodes are submitting using both the legacy and windowed endpoints
and thus having their legacy submissions rejected.
it is legal to use both the legacy and windowed endpoints
in phase1 since they use the same backend. the legacy endpoint
is disabled in phase2 and phase3.
therefore, if we wait an order expiration period (2 days) after
we determine enough nodes have started using the windowed
endpoint, we can be sure that any orders they did have to
submit with the legacy endpoint will have expired.
Change-Id: I4418a881bf8bb9377efaef4c651e6103a5dc6ed0
satellite.DB.Console().Projects().GetAll database query
can be replaced with planet.Uplinks[0].Projects[0].ID
Change-Id: I73b82b91afb2dde7b690917345b798f9d81f6831