This reverts commit 2c0a360a14.
Avoid big transactions. We'll do it outside of the migration pipeline.
Change-Id: Iade810d81bb2453c9e351149cb84662b207ee527
Value attribution codes were converted into UUIDs and stored in the users, projects, api_keys, bucket_metainfos, and value_attributions tables in the partner_id column. This migration will lookup the appropriate partner name associated with each of these UUIDs, and store the partner name directly in the user_agent column within each table. If an error occurs during the partner ID to partner name conversion, the partner ID value will be migrated to user_agent.
A note on the migration test data, postgres.v182.sql:
With one exception, all preexisting rows in the relevant tables had a NULL partner_ID. Therefore, we needed to insert new rows with partner_ID set under the OLD DATA section in order to test that the migration works. For each affected table, we insert one row with a valid partner ID which has a corresponding partner name, and one row with a partner ID which would return an error during the conversion to the partner name.
Change-Id: Iad977d72df0ce95a0c5ca80a065c4276ec1f2354
The main motivation is to wrap the bucket DB and metainfo DB, so we
could check if a bucket is empty before applying geofencing config.
Change-Id: I8bac21555e01d51a663fb557bc1acfc8106bc2e1
To allow for changing limits for new users, while leaving existing users limits as they are, we must store the project limits for each user. We currently store the limit for the number of projects a user can create in the user DB table. This change would also store the project bandwidth and storage limits in the same table.
Change-Id: If8d79b39de020b969f3445ef2fcc370e51d706c6
Alter satellites DB nodes table to add `disqualification_reason` int column
which contains disqualification type enum of why node has been disqualified.
Change-Id: Ia514557018ca27e1984216dc5004346d59869d16
We had a bug in the stray nodes chore where nodes who had not been seen
in several months were not being DQd. We figured out that this was
happening because we were using two queries: The first to grab
nodes where last_contact_success < some cutoff, the second to DQ them
unless last_contact_success == '0001-01-01 00:00:00+00'. The problem
is that if all of the nodes returned from the first query had
last_contact_success of '0001-01-01 00:00:00+00', we would pass them to
the second query which would not DQ them. This would result in the stray
nodes DQ loop ending since we found a number of nodes to DQ less than the
limit.
The fix: add the "WHERE last_contact_success != '0001-01-01
00:00:00+00'::timestamptz" to the selection query.
Change-Id: I4e60de90b68d8745d641b4467c2b23e0e56f7dff
Even though we want to start charging segment fee instead of object fee,
it's hard for users to understand what a segment is. This PR adds the
object count back in the UI alongside with segment count to help address
the issue.
Change-Id: I92eb42c769d350eba68a72443deffec5c278359c
table and drop not null constraint on objects column
Since, we want to move from charging our customers by object count to
segment count, this PR prepares the database to be able to record segments count
instead of objects count for satellite's billing system
Change-Id: Ie91ef354e78d24a268bc1cdc4327c182f733321e
listPendingTransitionShim is a temporary transition shim intended to
make existing API processes keep working when a future DB schema change
is executed. For more explanation, see the message on commit c053bdbd70.
However, the shim has a small bug: it is missing the ORDER BY clause
that appears in the original ListPending method. This transition shim
code won't ever run until we make the DB schema change, so this bug
hasn't hurt anything yet; it's just important that we fix it before the
DB schema change happens.
Change-Id: I5953651583ee236500c2c07141dfc9d690a95118
This update is to set up users being able to register with a promo code added to their account in place of the free tier coupon.
Change-Id: I7badf87937b12664f145520b6dcc4b26fe750407
We don't need column uses_segment_transfer_queue in graceful_exit_progress
as now all exiting nodes are using graceful_exit_segment_transfer_queue and
table graceful_exit_transfer_queue has been dropped.
Change-Id: I4b7c087433f04138cf09bcf8ad3d8de2c185502a
Multipart upload limits added. Last part has no size limit.
Max number of parts: 10000, min part size: 5 MiB
Change-Id: Ic2262ce25f989b34d92f662bde720d4c4d0dc93d
Union query for both reputable and new nodes didn't properly work. The
top level query is required to have an `AS OF SYSTEM TIME` statement as
well.
Change-Id: I8ee6dd5b700c2b1ed2aa562962bfa72be7eec30a
Uplink needs only part of columns we are reading from DB.
To improve performance we should read only those that are
realy needed.
Change-Id: Ib39259318169c46afe5fa4c6ce2184da82e960c8
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.
Change-Id: I5671722a5fc6e1749d3a49e187a56556000ff941
We don't use this column for anything. If you want to know if a node is
contained, you can check the pending_audits table.
Change-Id: I8da1d8e01a2dcaff63c5067a7927b5451424ad04
Removes database tables and functionality related to our custom
coupon implementation because it has been superseded by the Stripe
coupon and promo code system. Requires implementations of the
payments Invoices interface to return coupon usages along with
invoices.
Change-Id: Iac52d2ff64afca8cc4dbb2d1f20e6ad4b39ddfde
Cockroach doesn't like concurrent database creation, add a limiter to
avoid overloading the database.
Also limit the cockroach testing to the last 10 migrations.
Change-Id: Ie73c516ac2d362b251f605049e51eb25888432bd
Populate the egress_dead column for taking into account allocated bandwidth that can be removed because orders have been sent by the storage nodes. The bandwidth not used in these orders can be allocated again.
Change-Id: I78c333a03945cd7330aec052edd3562ec671118e
Drop table graceful_exit_transfer_queue which is not used anymore (replaced by graceful_exit_segment_transfer_queue).
Change-Id: Ie254fe9a54fb0784e350a439ce7a9bc99a3a58b5
Migration tests are very heavy on database schema changes, which may
cause delays and retries. Separate out the migration tests and ensure
that they do not run concurrently on the same database.
Change-Id: I35b17525f18fd923546ce1fcc12d805c95073b6b
Why: big.Float is not an ideal type for dealing with monetary amounts,
because no matter how high the precision, some non-integer decimal
values can not be represented exactly in base-2 floating point. Also,
storing gob-encoded big.Float values in the database makes it very hard
to use those values in meaningful queries, making it difficult to do
any sort of analysis on billing.
Now that we have amounts represented using monetary.Amount, we can
simply store them in the database using integers (as given by the
.BaseUnits() method on monetary.Amount).
We should move toward storing the currency along with any monetary
amount, wherever we are storing amounts, because satellites might want
to deal with currencies other than STORJ and USD. Even better, it
becomes much clearer what currency each monetary value is _supposed_ to
be in (I had to dig through code to find that out for our current
monetary columns).
Deployment
----------
Getting rid of the big.Float columns will take multiple deployment
steps. There does not seem to be any way to make the change in a way
that lets existing queries continue to work on CockroachDB (it could be
done with rules and triggers and a stored procedure that knows how to
gob-decode big.Float objects, but CockroachDB doesn't have rules _or_
triggers _or_ stored procedures). Instead, in this first step, we make
no changes to the database schema, but add code that knows how to deal
with the planned changes to the schema when they are made in a future
"step 2" deployment. All functions that deal with the
coinbase_transactions table have been taught to recognize the "undefined
column" error, and when it is seen, to call a separate "transition shim"
function to accomplish the task. Once all the services are running this
code, and the step 2 deployment makes breaking changes to the schema,
any services that are still running and connected to the database will
keep working correctly because of the fallback code included here. The
step 2 deployment can be made without these transition shims included,
because it will apply the database schema changes before any of its code
runs.
Step 1:
No schema changes; just include code that recognizes the
"undefined column" error when dealing with the
coinbase_transactions or stripecoinpayments_tx_conversion_rates
tables, and if found, assumes that the column changes from Step
2 have already been made.
Step 2:
In coinbase_transactions:
* change the names of the 'amount' and 'received' columns to
'amount_gob' and 'received_gob' respectively
* add new 'amount_numeric' and 'received_numeric' columns with
INT8 type.
In stripecoinpayments_tx_conversion_rates:
* change the name of the 'rate' column to 'rate_gob'
* add new 'rate_numeric' column with NUMERIC(8, 8) type
Code reading from either of these tables must query both the X_gob
and X_numeric columns. If X_numeric is not null, its value should
be used; otherwise, the gob-encoded big.Float in X_gob should be
used. A chore might be included in this step that transitions values
from X_gob to X_numeric a few rows at a time.
Step 3:
Once all prod satellites have no values left in the _gob columns, we
can drop those columns and add NOT NULL constraints to the _numeric
columns.
Change-Id: Id6db304b404e6fde44f5a8c23cdaeeaaa2324f20
Why: big.Float is not an ideal type for dealing with monetary amounts,
because no matter how high the precision, some non-integer decimal
values can not be represented exactly in base-2 floating point. Also,
storing gob-encoded big.Float values in the database makes it very hard
to use those values in meaningful queries, making it difficult to do
any sort of analysis on billing.
For better accuracy, then, we can just represent monetary values as
integers (in whatever base units are appropriate for the currency). For
example, STORJ tokens or Bitcoins can not be split into pieces smaller
than 10^-8, so we can store amounts of STORJ or BTC with precision
simply by moving the decimal point 8 digits to the right. For USD values
(assuming we don't want to deal with fractional cents), we can move the
decimal point 2 digits to the right.
To make it easier and less error-prone to deal with the math involved, I
introduce here a new type, monetary.Amount, instances of which have an
associated value _and_ a currency.
Change-Id: I03395d52f0e2473cf301361f6033722b54640265
This PR utilize the new burst limit column from projects table to allow
control on the limit for request per seconds and token bucket size
When no burst limit is explicitly set, rate limit is applied to both so
we don't limit how quickly request can be made in a second.
Change-Id: I883235c60c5d6416aeadd1c80ed2ebd193aa4d9f
In order to limit the amount of overall requests a user can issue in a
time span, we need to have the ability to define such limit separate
from per second request rate.
This PR adds a new column on the projects table to store the burst limit
per project.
Change-Id: I7efc2ccdda4579252347cc6878cf846b85146dc7
Remove the logic associated to the old transfer queue.
A new transfer queue (gracefulexit_segment_transfer_queue) has been created for migration to segmentLoop.
Transfers from the old queue were not moved to the new queue.
Instead, it was still used for nodes which have initiated graceful exit before migration.
There is no such node left, so we can remove all this logic.
In a next step, we will drop the table.
Change-Id: I3aa9bc29b76065d34b57a73f6e9c9d0297587f54
nodes and audit_history tables
This PR removes all code reference to audit_histories table and
```
audit_reputation_alpha, audit_reputation_beta,
unknown_audit_reputation_alpha, unknown_audit_reputation_beta,
```
columns from nodes table.
It also drops audit_histories table from the db since the code
that's referencing it currently are not being used.
Change-Id: Ifcda8db36afb3a333d487ff831f2fdefc8b02a4c
gracefulexit
With the effort to move audit related data into reputation store, this
PR updates gracefulexit endpoint to use reputation service to get a
node's audit score
Change-Id: Iad93ea689ad67ff9c57c7be16687e21e715fab7a
Currently, reputation table is only populated when a node has been
audited. This is ok in production, however a lot of our tests doesn't
upload any data or trigger audits.
This PR adds an initialization step in testplanet to populate reputation
table with zero value for nodes reputation.
Change-Id: I11b381236669db346dc68a48a6d4a27334a0a8b8
package in audit
This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.
In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.
Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
Added MFA passcode and recovery code field for token requests.
Added endpoints for MFA-related activity: enabling MFA,
disabling MFA, generating a new MFA secret key, and
generating new MFA recovery codes.
Change-Id: Ia1443f05d3a2fecaa7f170f56d73c7a4e9b69ad5
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration and made accessible through graceful exit db in another one.
This changes makes graceful exit enqueue transfer items for new exiting nodes in the new table.
Change-Id: I7bd00de13e749be521d63ef3b80c168df66b9433
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration.
This change gives access to this table.
Graceful Exit doesn't use the table yet, this will be done in a next change.
Change-Id: I6c09cff4cc45f0529813a8898ddb2d14aadb2cb8
Bucket tally calculation will be removed from metaloop and will
use metabase objects iterator directly.
At the moment only bucket tally needs objects so it make no sense
to implement separate objects loop.
Change-Id: Iee60059fc8b9a1bf64d01cafe9659b69b0e27eb1
Columns for MFA status, secret key, and JSON-encoded array of
recovery codes are added to the users table.
Change-Id: Ifed7e50ec9767c1670d9682df1575678984daa60
When a user adds a credit card, switch them to the paid tier and update
their projects with new bandwidth/storage limits. New projects for the
paid tier user will also have the updated limits.
The new limits are:
* storage per project - 50 GB free/25 TB paid
* bandwidth per project - 50 GB free/100 TB paid
Change-Id: I7d6467d077e8bb2bbe4bcf88ab8d75490f83165e
The previous query was making a full table scan. This modifies code to
do the queries separately on each action. It will probably be slower on
a small table, however there should be a several magnitude boost on
large tables.
Change-Id: Ib8885024d8a5a0102bbab4ce09bd6af9047930c9
We want to calculate bucket tally only from iterating objects.
Object currently has an info about totals for bytes and segments.
We need to adjust tallies to keep those totals. Older entries will
be untouched and code will use totals only if available. Change
is adding columns for totals to bucket_storage_tally table and
is adding general handling for them.
Next step is to start using total columns instead of inline/remote.
This will be done with next change.
Change-Id: I37fed1b327789efcf1d0570318aee3045db17fad
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.
Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
So that we can easily see whether a user is in the paid tier without
querying for payment methods.
Change-Id: I122566ddd0953203f852741fa12c71795bc1ec5c
Period end was calculated
incorrectly as it was still in current month but
should be the first day of next month.
Change-Id: I37451d29a9b901b69e6c3c401b333c58b3376d61
Adding AS OF SYSTEM TIME to query that is calculating project bandiwdth.
As an addition method for setting interval is added as test doesn't
work well with default interval.
Change-Id: Id1e15be4f6afff13b9dc2b7f595e2edb6de28db9
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.
Table pending_audits will be dropped with next release.
Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
table graceful_exit_segment_transfer_queue will be used to replace graceful_exit_transfer_queue. Currently, it uses the path of a segment to keep track of pieces to be transferred. As we want to use the segment metainfo loop, we will need to record stream_id and position of the segment instead of relying on object path.
This change also add a uses_segment_transfer_queue column to the graceful_exit_progress table to be able to know if a transfer has been initiated while using the old table.
Change-Id: Iafb1e8e65ba124e20de4a9ff76da181c3222de7e
Define service and DB interface for storing node reputation data
and updating the overlay cache.
Add overlay service and DB method UpdateReputation.
See https://github.com/storj/storj/pull/4144
Change-Id: Iedd8bd3274457d26c595919303d55327c1464b8c
The reputation table duplicates the reputation information in the
nodes table. It will be used for implementing the reputation
service.
Change-Id: I36c0318e8fa5f535e9d527df95b22a4f9eb365d4
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.
Later we still need to drop table with migration step.
Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
The redis key associated to bandwidth usage depended on the current month.
This change makes it depends also on the day, so that we update bandwidth usage
daily to take into account changes associated to expired but not used allocated bandwidth.
It also add a test to to check that the allocated but not settled bandwidth is not counted after 2 days.
Change-Id: Iee9dbe3517cc3b9825438360b276a07a43dfbc64
Migration step for adding a 'egress_dead' column to the project_bandwidth_daily_rollups.
It will be used to track bandwidth allocation that won't be consumned
as the corresponding order has already been processed and has a settled
bandwidth amount lower than the order limit (allocated bandwidth).
Change-Id: Ic07592e69292ae2076e69f6038bb0e0fae79b271
Full prefix: web/satellite, satellite/{console, analytics, satellitedb}
- checkbox added to register view - business tab
- user being saved with new column
- add sales contact choice to Segment calls
- ui fix added to employee count dropdown
Change-Id: Ib976872463b88874ea9714db635d58c79cdbe3a1
We are not using this table so make no sense to put data there.
This change removes only code that is using this table. Before next
release we need to drop table with migration step.
Change-Id: I80f400aa778c717e70324bd00da502b7032c9d9f
Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.
Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
Because of recent changes to how coupons for the free tier are handled
(see commit 4c0817bcfb), we no longer want all these $10
non-expiring coupons. After coupons are applied during invoice
generation, if a customer does not have any valid (non expired, non
consumed) coupons, a new promotional coupon is applied.
We could just wait for users to consume all $10 of the non-expiring
coupons, and the new promotional coupon would be applied for the
following billing cycle, but this gets tricky, because if in the final
month, the user is billed for $1 of usage, but only $0.5 of the $10
non-expiring coupon is remaining, the user will be charged for the
remaining $0.5. With the new promotional coupon of $1.65, expiring every
month, this would not be an issue.
So long story short, this commit migrates all non-expiring coupons to
expire within 2 billing periods (all existing non-expiring coupons in
prod were created in early April or later). That way, there is still
enough value in the $10 coupon that we don't have to worry about
customers exceeding it, and the coupons will expire. Then we'll
immediately apply the $1.65 coupon for the next month! And then
hopefully this unfortunate situation will come to a pleasant end.
Change-Id: I8a593948d8876c41a71d886b9a95d4e2c802b4f3
Replace GetProjectAllocatedBandwidth by GetProjectBandwidth which calculates
used bandwidth from allocated and settled bandwidth recorded in the
project_bandwidth_daily_rollups table.
For each day in the month, if the allocated bandwidth is expired, it uses the
settled bandwidth for computing used bandwidth.
Change-Id: Ife723c6d5275338f470619631acb25930d39ac3c
project_bandwidth_daily_rollups table
We want to calculate used bandwidth better so we need to calculate it
from allocated and settled bandwidth. To do this we need first populate
this new table.
https://storjlabs.atlassian.net/browse/PG-56
Change-Id: I308b737bf08ee48ce4e46a3605697ab2095f7257
Improve UpdateCheckIn on a contended row:
name old time/op new time/op delta
UpdateCheckInContended-100x-32 2.29s ±55% 0.17s ±61% -92.45% (p=0.008 n=5+5)
Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.
Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.
Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
Initially metabase was developed separately and it was useful to have a
separate environment flag for tests, however, it's more convenient to
use the same as rest of the testsuite.
Change-Id: Ia4d79be27ce5911cbae68d57cdf0b30f63459444
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.
As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.
Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.
Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
Use the 'AS OF SYSTEM TIME' Cockroach DB clause for the Graceful Exit
(a.k.a GE) queries that count the delete the GE queue items of nodes
which have already exited the network.
Split the subquery used for deleting all the transfer queue items of
nodes which has exited when CRDB is used and batch the queries because
CRDB struggles when executing in a single query unlike Postgres.
The new test which has been added to this commit to verify the CRDB
batch logic for deleting all the transfer queue items of the exited
nodes has raised that the Enqueue method has to run in baches when CRDB
is used otherwise CRDB has return the error "driver: bad connection"
when a big a amount of items are passed to be enqueued. This error
didn't happen with the current test implementation it was with an
initial one that it was creating a big amount of exited nodes and
transfer queue items for those nodes.
Change-Id: I6a099cdbc515a240596bc93141fea3182c2e50a9
multiregion satellites have complex database connection strings
largely due to using a different backend for the repair queue than
cockroach.
billing stuff didn't work right with this.
Change-Id: Ie8759a8c47e71347c3a190abfc9d53945d7b8855
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.
Add statement to record audit results in reverify tests to ensure no
more false positives.
Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
The DBs of our production satellites have some indexes that we didn't
have in the migrations because at that time we weren't able to add them
because our migration test was not able to deal with Cockroach indexes
with the STORING clause.
We have recently modified the storj.io/private/dbutil/pgutil package to
support the CRDB STRORING clause, so we are adding the missing indexes
to our migrations for being able to have them if we have to recover a DB
from scratch or we deploy a new DB satellite.
Change-Id: I686ff84e5b4c02d9615f50fa531261363affefb8
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
The previously configured never-expiring coupon does not refill every
month. Eventually, even though it never expires, it will run out. This
commit makes several small changes to address this issue for the free
tier:
* Change the config for the promotional coupon to be $1.65 for 1 month
(the change from $10 to $1.65 is due to our recent pricing changes)
* Update PopulatePromotionalCoupons (PPC for brevity) to add promotional
coupons to users with expired and consumed coupons (all users with a
project and no active coupons should get a new coupon when PPC is called)
* Call PPC at the end of the `create-invoice-coupons` stage of invoice
generation - after current coupons are processed and expired/exhausted.
* Remove legacy admin functionality for PPC from satellite/console - we
do not currently use it, but if we did, it should be in satellite/admin
instead.
Change-Id: I77727b97bef972df32ebb23cdc05055827076e2a
For business accounts we need to track the sales contact.
It will be a question to business accounts during onboarding.
Change-Id: I8d101ce1b52091478dfb0ddd875e1cc717d765d3
Initially there were pkg and private packages, however for all practical
purposes there's no significant difference between them. It's clearer to
have a single private package - and when we do get a specific
abstraction that needs to be reused, we can move it to storj.io/common
or storj.io/private.
Change-Id: Ibc2036e67f312f5d63cb4a97f5a92e38ae413aa5
cache is really common variable and type name and we have already used
the package name alias in multiple places.
Change-Id: I6435785b7549b541d533de59ec94557b9bd11e04
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.
Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz
Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
The new default promotional coupon is $10/month, and doesn't expire.
This change also migrates the coupon.duration column over to the new
coupon.billing_periods, and switches to rely completely on
billing_periods.
Change-Id: Ic3341e9fa4040449bab5e66ca4ee2640b095cf3d
Currently we can have an error about duplicated entry while inserting into value_attributions table. This change is changing simple insert into insert that is doing nothing on conflict.
Change-Id: I3efd8dc0b63115e8e2ed8f4196ccf969ee942295
* Add a nullable billing_periods column in the coupons table
* Add nullable billing_periods column to the currently unused
coupon_codes table
* Drop the duration column from the coupon_codes table
* Replace duration config type so that the default promotional coupon
can be configured to never expire
Zero downtime migration plan:
* Add billing_periods column to coupons and coupon_codes tables (this change)
* After one release, remove all references to the old duration column,
replacing with references to billing_periods. At this point, we can also
change the defult promotional coupon to never expire and migrate over
values from the old duration column.
* After another release, drop the duration column.
Change-Id: I374e8dc9fab9f81b4a5bc681771955662d4c007a
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.
Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.
If the flag is set to true, nodes will be suspended if they meet the
requirements.
If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.
Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.
Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
This is one step for implementing the free tier:
* Change the default project limit from 10 to 3
* Move storage and bandwidth project usage limits from the metainfo
package to the console package (otherwise there is a cyclical
dependency, and metainfo doesn't use these values anyway)
* Change the default storage usage limit per project from 500gb to 50gb
* Change the default bandwidth usage limit per project from 500gb to 50gb
* Migrate the database so that old users and projects continue to have
the old defaults (10 projects/500gb usage)
Change-Id: Ice9ee6a738bc6410da18c336c672d3fcd0cab1b9
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.
Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
The coupon_codes table will allow for administrators to create new promo
codes associated with coupon information (amount, duration, etc...).
A user will be able to enter a promo code (aka coupon code) in order to
apply a new coupon to their account. The coupon in the coupons table is
linked to the template defined in the coupon_codes table.
Change-Id: I50e49fa92afbc6aa9d01d8a895c069efb59e472b
Migration step 148 will cause errors because we missed some
references to the columns being dropped. Removing the step
altogether causes problems with backwards compatibility tests
because the change already exists in the latest release tag.
To circumvent, we change v148 to an empty migration.
Add methods FindTable and RemoveColumn in private/dbutil/dbschema
Change-Id: Ia527e95b88a88c5dc82800928ce6f8cfb879e334
cockroach is having problems with huge transactions and
having them complete before timeouts or whatever, so
do smaller transactions.
because we can have partial recording of payments which
are not unique, we have to do a thing where we read and
check if it already exists before writing. this is not
concurrency safe.
Change-Id: Ia7d59499a43ce6d70cb2a23754edbdd1b643ef1a
The new 'consistency ge-cleanup-orphaned-data' cli command deleted
orphaned transfer queue items, but not entries in the
graceful_exit_progress table. This will delete orphaned entries
from the exit progress table too.
Change-Id: I5f927aac1f258490678deaf179be92ccfe10fcd8
Turns out that many methods generated with dbx where not used at all.
Lets remove them.
As a next step we can think about dropping tables like:
* user_credit
* offer
Change-Id: Id6cda81a701348db2a6b8b26daa22ae9c4f87cb4
Pregenerate the database schema we should use for most tests.
Currently, Cockroach is slow with regards to migration and it's
better if it happens in as few transactions as possible.
This reduces test time from ~21min to ~15min.
Change-Id: Ife8117053e6b9ecf3c93fe63677edf15d4d7c254
Testing interfaces is slightly clearer when it's in the package needing
the database rather than each individual implementation.
Change-Id: I10334c214a205f7e510b939b4359a2214c4e060a
This PR removes all back-end related referral program code including the
marketing portal.
We will have a separate PR for front-end code and database migration to
drop `offers` and `usercredits` table
Change-Id: If59f952cddfe0558a7dc03a0eac7cc1081517f88
These changes are independently tracked on
https://github.com/storj/storj/tree/jt/migration-reorder
The point of this is to make the distributed column
migration, needed for SNO invoice generation, the very
next one, so we can release it as a point release.
Change-Id: I26e1c03629c4f079b9ad12485e2b71a715d82b3b
Limit bucket name lookup to date range of the calling methods since we only need distinct bucket names for that time period.
Adds new index and removes an index specific to project ID since it is no longer needed.
Change-Id: Ic07bbfb1c32280e0c0e39f8da020b284e1e5d974
Fix an issue due to copy-paste problem that made that the Graceful Exit
test to be flaky.
The test uses a time created at the beginning of the test for avoiding
to get undeterministic time differences due to the fact of the response
time variation by the DB queries, however some part of the test were
using a current time rather than this base time, so they have been
addressed.
Change-Id: I4786f06209e041269875c07798a44c2850478438
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
Add a command to the satellite for cleaning up the Graceful Exit (a.k.a
GE) transfer queue items of nodes that have exited.
The commit adds to the GE satellite DB a couple of new methods, and its
corresponding test, for performing the operations of the new command.
Change-Id: I29a572a59689d63b24990ac13c52e76d65aaa917
using redash i manually checked that the only times the sum of
the payments does not match the paid column is for 2020-12 and
if it does not match then there are no payments.
Change-Id: I71ce0571de7e38e21548d7d6757b25abc3bfa781
The rollup archiver chore moves bucket bandwidth rollups and
storagenode rollups that are older than a given duration
to two new archive tables.
Change-Id: I1626a3742ad4271bc744fbcefa6355a29d49c6a5
This index is obsolete and duplicates a similiar (project_id, name)
index on the same table.
Moreover, it might confuse CockroachDB which of the two index to use,
which may might affect DB performance.
Change-Id: If8d1df8347714942cea9dca82864ba5f4973bed3
When we observed the value for total piecesizes stored in the network,
we were doing it after converting them to byte-hours, rather than using
the actual piece sizes. This fixes that issue.
Change-Id: I1564d21b519f70eb59f298d97dbd777baf127723
Add ProjectsCursor type for pagination
Add PageCount, CurrentPage, and TotalCount ProjectsPage
This allows us to mimic the logic of GetBucketTotals and the
implementation of BucketUsages in graphql for the new ProjectsByOwnerID
functionality.
Change-Id: I4e1613859085db65971b44fcacd9813d9ddad8eb
Full scope:
private/testplanet,satellite/{overlay,satellitedb}
Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.
There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.
Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
This is the first step in the removal of uptime columns on the
nodes table. These columns are no longer used:
uptime_success_count
total_uptime_count
uptime_reputation_alpha
uptime_reputation_beta
In order to avoid breaking backwards compatibility, we need to
remove all references to these columns before removing the columns
themselves from the database. However, since uptime_success_count
and total_uptime_count are NOT NULLABLE, we can't remove them from
the insert statements in the overlay. So we can't remove the columns
because of the references, and we can't remove the references because
the columns can't be null. What a pickle. To remedy this, we will set a
default on the columns. Then we should be able to remove them from the
insert statements
Change-Id: I75f6c56fb7897835bbf29869f86f39de1d9dd345
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.
Change-Id: Idb829d742e1f987e095604423fff656fe581183e
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation
Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
Now that the deprecated downtime tracking service is removed
(3fc76f4ffe), we can safely remove
the nodes_offline_times table.
Change-Id: Ia7c6efe32ba104dff5a830af5f2beee3337eefe5
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.
Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.
The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).
2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..
This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.
Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
We need to be able to list all buckets in DB without knowing project ID.
This method will be used to list buckets for metainfo loop
implementation based on metabase.
Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.
This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.
Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
Which database access and how it internally does migrations is an
implementation detail and does not belong in the requirements interface.
Change-Id: Ia4a6994f39470063a96a8e5f3a1bd27aa79fe5cd
these are unchanged from storj.io/dbx.
we're importing them because in a later commit we
will change them, and it'd be nice to see that
diff as a separate commit.
Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.
Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.
Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
This fixes a slow query that was taking up to 4 seconds in production
SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count
FROM graceful_exit_transfer_queue
WHERE node_id = '[redacted]'
AND finished_at is NULL
AND last_failed_at is NULL
ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0;
Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549
Use tagsql.DB pointer as step database, to propagate changes
back and forth between actual database and migration.
Adds CreateDB operation to the migration step to be able to
create new dbs before executing migration action.
Adjusts storagenode database migration to use inner tagsql.DB
pointer of each database as step.DB.
Adjusts satellite dabase migration, adds proxy migrationDB field
to satellite db that wraps itself as tagsql.DB, pointer of which
is used as step.DB.
Change-Id: Ifed4de5b01a356cf7b37db64d2eaeb7b61982c5c
This change completes the column migration of
5f6fccc6e8 and
2f648fd981.
It resets every users project limits who are below or equal to our
current production defaults.
Change-Id: Ie041d08bb67b62844f6023190fc00bc2dad5b1cb
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.
Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
we have thundering herds of order submissions that take all of the
database connections causing temporary periodic outages. limit
the amount of concurrent order processing to 2.
Change-Id: If3f86cdbd21085a4414c2ff17d9ef6d8839a6c2b
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.
However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.
Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.
Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.
solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.
Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
Currently Cockroach migration test is the most heavy with regards to
schema changes. This causes other tests to time out. This adds an
alternate cockroach instance that is used for migration tests.
Change-Id: I01fe9313527ff002f0bb0914dd52c3645b8eaf6d
This PR adds the following items:
1) an in-memory read-only cache thats stores project limit info for projectIDs
This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out.
The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache.
The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period..
Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f
Our current endpoints bail on us, if the column data is null. Thus we need
to take the intermediate step and set the default to a fixed value and
reset those with the following release.
It sets the default column value to our current config values of 50GB
for storage and bandwidth and 100 buckets, while still enabling the field to be nullable.
All 0 values are migrated to be the default as well to ensure they can
keep using their projects, as with the original change, 0 actually means 0.
Change-Id: I797be80ce2d2105091599dc1b3fc76f74336b66b
Currently we have no way to actually set one
of the following limits to 0 (meaning not usable):
- maxBuckets
- usageLimit
- bandwidthLimit
With having the field nullable,
NULL corresponds to the global default,
0 now actually 0 and
a set value determines a custom limit.
Change-Id: I92bb77529dcbd0881ae8368921be9d246eb0919e
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)
Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.
Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
metabaseSegmentKey` TransferQueueItem
We are unifying which name (and type) we are using for value we are
using to point to segment. We want to use `key` instead of `path`.
Dedicated type `metabase.SegmentKey` was created for this purposes also.
This change is doing refactoring around gracefulexit.
Change-Id: I90d51ff087b206179e61d5f1bc95f4709d76f917
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.
Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.
In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.
Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.
Change-Id: I032b60298840fc16e6ef831da750f2d57619a397