The rollup archiver chore moves bucket bandwidth rollups and
storagenode rollups that are older than a given duration
to two new archive tables.
Change-Id: I1626a3742ad4271bc744fbcefa6355a29d49c6a5
This index is obsolete and duplicates a similiar (project_id, name)
index on the same table.
Moreover, it might confuse CockroachDB which of the two index to use,
which may might affect DB performance.
Change-Id: If8d1df8347714942cea9dca82864ba5f4973bed3
This is the first step in the removal of uptime columns on the
nodes table. These columns are no longer used:
uptime_success_count
total_uptime_count
uptime_reputation_alpha
uptime_reputation_beta
In order to avoid breaking backwards compatibility, we need to
remove all references to these columns before removing the columns
themselves from the database. However, since uptime_success_count
and total_uptime_count are NOT NULLABLE, we can't remove them from
the insert statements in the overlay. So we can't remove the columns
because of the references, and we can't remove the references because
the columns can't be null. What a pickle. To remedy this, we will set a
default on the columns. Then we should be able to remove them from the
insert statements
Change-Id: I75f6c56fb7897835bbf29869f86f39de1d9dd345
Now that the deprecated downtime tracking service is removed
(3fc76f4ffe), we can safely remove
the nodes_offline_times table.
Change-Id: Ia7c6efe32ba104dff5a830af5f2beee3337eefe5
We need to be able to list all buckets in DB without knowing project ID.
This method will be used to list buckets for metainfo loop
implementation based on metabase.
Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
This fixes a slow query that was taking up to 4 seconds in production
SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count
FROM graceful_exit_transfer_queue
WHERE node_id = '[redacted]'
AND finished_at is NULL
AND last_failed_at is NULL
ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0;
Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549
This change completes the column migration of
5f6fccc6e8 and
2f648fd981.
It resets every users project limits who are below or equal to our
current production defaults.
Change-Id: Ie041d08bb67b62844f6023190fc00bc2dad5b1cb
This PR adds the following items:
1) an in-memory read-only cache thats stores project limit info for projectIDs
This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out.
The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache.
The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period..
Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f
Our current endpoints bail on us, if the column data is null. Thus we need
to take the intermediate step and set the default to a fixed value and
reset those with the following release.
It sets the default column value to our current config values of 50GB
for storage and bandwidth and 100 buckets, while still enabling the field to be nullable.
All 0 values are migrated to be the default as well to ensure they can
keep using their projects, as with the original change, 0 actually means 0.
Change-Id: I797be80ce2d2105091599dc1b3fc76f74336b66b
Currently we have no way to actually set one
of the following limits to 0 (meaning not usable):
- maxBuckets
- usageLimit
- bandwidthLimit
With having the field nullable,
NULL corresponds to the global default,
0 now actually 0 and
a set value determines a custom limit.
Change-Id: I92bb77529dcbd0881ae8368921be9d246eb0919e
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.
Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
It's an obsolete table from earlier state of Stripe invoices
implementation. No code is currently using it. It is confirmed that this
table is currently empty across all satellites.
Change-Id: I12d2756578faf8418ea8f3b09088e885694b8925
Jira: https://storjlabs.atlassian.net/browse/USR-822
This the last step of dropping these 2 db tables. It also deletes all
code associate with them.
Change-Id: I8be840dc2a7be255cf6308c9434b729fe4d9391e
Why: We need a way to cut down on database traffic due to bandwidth
measurement and tracking.
What: This changeset is the Satellite side of settling orders in 1 hr windows.
See design doc for more details: https://review.dev.storj.io/c/storj/storj/+/1732
Change-Id: I2e1c151e2e65516ebe1b7f47b7c5f83a3a220b31
This system tracks an abstract "api version" from nodes based on
their usage, allowing us to have latching behavior where if a node
ever uses a new api, it can be blocked from using the old api.
This is better than using self-reported semver version information
because the node cannot lie, there's no confusion about what semver
version implies which features, no questions about dev and ci
environments, and no dependencies between reporting the version
and using the new api.
Change-Id: Ifeced5c9ae8e0a16102d79635e176a7d3bdd8ed4
When a request comes in on the satellite api and we validate the
macaroon, we now also check if any of the macaroon's tails have been
revoked.
Change-Id: I80ce4312602baf431cfa1b1285f79bed88bb4497
add new columns `offline_suspended` and `under_review` to nodes table.
`unknown_audit_suspended` is a new column which will replace `suspended`
Change-Id: I22ddeb338ea0ff63f14332a7ebd0f3e9e4c06cdc
To avoid including multiple months in a single invoice, we need all
inspector's invoice commands to run in for specific period.
See https://storjlabs.atlassian.net/browse/USR-725
Change-Id: I3637dc189234f02350daca8d897c21765762ea55
This reverts commit 105dc7acc6.
Reason for revert: Recent changes to the Postgres query plan seems to want to use this index now. Reverting until we have time to analyze what's happening.
Change-Id: I74b4b5a8f15c3850d8a958a29f51dbc80e7c282c
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair
This commit also removes unused overlay functionality.
Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).
Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states
Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce