Jenkins has been failing a lot lately due to test timeouts with CockroachDB.
TestMigrateCockroach previously took around 5 minutes, now it takes 2.
Why 103? I couldn't get 100 to work due to an error w/ NOT NULL and PKs.
Change-Id: Iec95d4e25f9d6cd36920e7f43272c486a17fa879
It's an obsolete table from earlier state of Stripe invoices
implementation. No code is currently using it. It is confirmed that this
table is currently empty across all satellites.
Change-Id: I12d2756578faf8418ea8f3b09088e885694b8925
Jira: https://storjlabs.atlassian.net/browse/USR-822
This the last step of dropping these 2 db tables. It also deletes all
code associate with them.
Change-Id: I8be840dc2a7be255cf6308c9434b729fe4d9391e
This change switches the backend logic to use the new DB column on the users table to restrict project creation.
Furthermore it back fills the existing limits from registration tokens to the new column to ensure no users are reset to the new default.
UI is updated to reflect ability to create several projects
Change-Id: Ie29157430ae6b065411ca4c4557c9f1be69cdc4f
What:
Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.
Why:
github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.
Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
This system tracks an abstract "api version" from nodes based on
their usage, allowing us to have latching behavior where if a node
ever uses a new api, it can be blocked from using the old api.
This is better than using self-reported semver version information
because the node cannot lie, there's no confusion about what semver
version implies which features, no questions about dev and ci
environments, and no dependencies between reporting the version
and using the new api.
Change-Id: Ifeced5c9ae8e0a16102d79635e176a7d3bdd8ed4
Use a field to distinguish migration steps that need to use a
different transaction from previous steps. This is clearer than
using a func.
Change-Id: I2147369d05413f3e8ddb50c71a46ab1ba3ab5114
When a request comes in on the satellite api and we validate the
macaroon, we now also check if any of the macaroon's tails have been
revoked.
Change-Id: I80ce4312602baf431cfa1b1285f79bed88bb4497
add new columns `offline_suspended` and `under_review` to nodes table.
`unknown_audit_suspended` is a new column which will replace `suspended`
Change-Id: I22ddeb338ea0ff63f14332a7ebd0f3e9e4c06cdc
the initial calculations for the historical values of comp_at_rest
were wrong. because our historical data only included total amounts
as well as compensation for bandwidth, the at rest value was
calculated as
at_rest = total - bandwidth
unfortunately, that calculation did not take surge pricing into
account correctly. the at rest and bandwidth values do not
include surge pricing, but the total that was used did. so what
we actually calculated was
no_surge_at_rest = surge_total - no_surge_bandwidth
which will create a value that is too large. this migration
fixes the calculation for imports that are old enough and
of a non-negligable difference.
Change-Id: I61eb0b670510f6d7fb8fc3de39ba79150fac10eb
This attempts to add a README.md to help create consistent migrations
that maximize our test coverage and do not include unnecessary
statements.
It also adds a feature to have an `-- OLD DATA --` section as well
as a `-- NEW DATA --` section so that we can fix mistakes made in
previous snapshots (like a row that was forgotten to be added when a
table was created) without editing them going forward.
Change-Id: I28a786f8ef163cae1de1bb08f61af1e1104b0a88
To avoid including multiple months in a single invoice, we need all
inspector's invoice commands to run in for specific period.
See https://storjlabs.atlassian.net/browse/USR-725
Change-Id: I3637dc189234f02350daca8d897c21765762ea55
CreateTables hasn't been quite true for a while now, rename to
MigrateToLatest to be clearer in it's behavior.
Change-Id: Ida48e95122a5d9b7a814e922d3698e00024a2ba7
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.
Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.
Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
This reverts commit 105dc7acc6.
Reason for revert: Recent changes to the Postgres query plan seems to want to use this index now. Reverting until we have time to analyze what's happening.
Change-Id: I74b4b5a8f15c3850d8a958a29f51dbc80e7c282c
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
We missed this in the migration that added the num_healthy_pieces
column. It exists in dbx, but not on the actual satellite table.
Change-Id: If16b5ec2325d56406250298531b3285215188bf3
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
these tables are used in future commits with respect to the new
storagenode payments code. if we create them now, it will make
backfilling them with historical data easier.
Change-Id: I3c08c9770ec5b2baa38b4f2fd18c2f07746a61c2
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.
The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.
We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.
Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
The old migration was not working. It was updateding pending (status 0)
and failed (status -1) to completed (status 100).
Change-Id: I808ff3cc692fe6c698ce26a8b411b134e67b752b
Currently Cockroach DB setup takes a significant amount of time.
This flattens the database setup into a single query,
which improves the test time significantly.
The migration tests still test each migration separately.
Change-Id: Iaca16f34a6af3926fa2b5ebf618f939fd59460b3
Limits how many times metainfo APIs can be called per second by project ID. If limit is exceeded, the API will return Unauthorized/Too Many requests.
Limit per second and the size of the limiter cache per project are configurable, as well as whether the limiter is enabled.
Tests added/updated for the new rate_limit field in projects table.
Tests added for exceeding limits and disableing limiter.
Change-Id: Ic8ad102de3b690a475809d4f684156d5715f20fa
Also added temporary types withRebind and withTagTx,
which will be later removed. Currently they help to avoid
changing the whole codebase at the same time.
Change-Id: I7f07ba8f4709a23a463bfa67464628665a05808f