Until now, whenever audits were recorded we would try to delete
the node from containment just in case it exists. Since we now
want to treat segment repair downloads as audits, this would
erroneously remove nodes from containment, as repair does not go
through a Reverify step. With this changeset, (Batch)UpdateStats
will not remove nodes from containment. The Reverify method will
remove all necessary nodes from containment.
Change-Id: Iabc9496293076dccba32ddfa028e92580b26167f
Because of recent changes to how coupons for the free tier are handled
(see commit 4c0817bcfb), we no longer want all these $10
non-expiring coupons. After coupons are applied during invoice
generation, if a customer does not have any valid (non expired, non
consumed) coupons, a new promotional coupon is applied.
We could just wait for users to consume all $10 of the non-expiring
coupons, and the new promotional coupon would be applied for the
following billing cycle, but this gets tricky, because if in the final
month, the user is billed for $1 of usage, but only $0.5 of the $10
non-expiring coupon is remaining, the user will be charged for the
remaining $0.5. With the new promotional coupon of $1.65, expiring every
month, this would not be an issue.
So long story short, this commit migrates all non-expiring coupons to
expire within 2 billing periods (all existing non-expiring coupons in
prod were created in early April or later). That way, there is still
enough value in the $10 coupon that we don't have to worry about
customers exceeding it, and the coupons will expire. Then we'll
immediately apply the $1.65 coupon for the next month! And then
hopefully this unfortunate situation will come to a pleasant end.
Change-Id: I8a593948d8876c41a71d886b9a95d4e2c802b4f3
Replace GetProjectAllocatedBandwidth by GetProjectBandwidth which calculates
used bandwidth from allocated and settled bandwidth recorded in the
project_bandwidth_daily_rollups table.
For each day in the month, if the allocated bandwidth is expired, it uses the
settled bandwidth for computing used bandwidth.
Change-Id: Ife723c6d5275338f470619631acb25930d39ac3c
project_bandwidth_daily_rollups table
We want to calculate used bandwidth better so we need to calculate it
from allocated and settled bandwidth. To do this we need first populate
this new table.
https://storjlabs.atlassian.net/browse/PG-56
Change-Id: I308b737bf08ee48ce4e46a3605697ab2095f7257
Improve UpdateCheckIn on a contended row:
name old time/op new time/op delta
UpdateCheckInContended-100x-32 2.29s ±55% 0.17s ±61% -92.45% (p=0.008 n=5+5)
Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.
Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.
Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
Initially metabase was developed separately and it was useful to have a
separate environment flag for tests, however, it's more convenient to
use the same as rest of the testsuite.
Change-Id: Ia4d79be27ce5911cbae68d57cdf0b30f63459444
Currently we were duplicating code for AS OF SYSTEM TIME in several
places. This replaces the code with using a method on
dbutil.Implementation.
As a consequence it's more useful to use a shorter name for
implementation - 'impl' should be sufficiently clear in the context.
Similarly, using AsOfSystemInterval and AsOfSystemTime to distinguish
between the two modes is useful and slightly shorter without causing
confusion.
Change-Id: Idefe55528efa758b6176591017b6572a8d443e3d
Use the 'AS OF SYSTEM TIME' Cockroach DB clause for the Graceful Exit
(a.k.a GE) queries that count the delete the GE queue items of nodes
which have already exited the network.
Split the subquery used for deleting all the transfer queue items of
nodes which has exited when CRDB is used and batch the queries because
CRDB struggles when executing in a single query unlike Postgres.
The new test which has been added to this commit to verify the CRDB
batch logic for deleting all the transfer queue items of the exited
nodes has raised that the Enqueue method has to run in baches when CRDB
is used otherwise CRDB has return the error "driver: bad connection"
when a big a amount of items are passed to be enqueued. This error
didn't happen with the current test implementation it was with an
initial one that it was creating a big amount of exited nodes and
transfer queue items for those nodes.
Change-Id: I6a099cdbc515a240596bc93141fea3182c2e50a9
multiregion satellites have complex database connection strings
largely due to using a different backend for the repair queue than
cockroach.
billing stuff didn't work right with this.
Change-Id: Ie8759a8c47e71347c3a190abfc9d53945d7b8855
When audits are being recorded, we automatically add some SQL to remove
the node from the pending audits table in case it exists. They are
removed from pending audits even if the node was offline for the audit.
This is not the correct behavior.
Add statement to record audit results in reverify tests to ensure no
more false positives.
Change-Id: I186ae68bc5e7962ef6c5defbebc1d95e63596a17
The DBs of our production satellites have some indexes that we didn't
have in the migrations because at that time we weren't able to add them
because our migration test was not able to deal with Cockroach indexes
with the STORING clause.
We have recently modified the storj.io/private/dbutil/pgutil package to
support the CRDB STRORING clause, so we are adding the missing indexes
to our migrations for being able to have them if we have to recover a DB
from scratch or we deploy a new DB satellite.
Change-Id: I686ff84e5b4c02d9615f50fa531261363affefb8
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
The previously configured never-expiring coupon does not refill every
month. Eventually, even though it never expires, it will run out. This
commit makes several small changes to address this issue for the free
tier:
* Change the config for the promotional coupon to be $1.65 for 1 month
(the change from $10 to $1.65 is due to our recent pricing changes)
* Update PopulatePromotionalCoupons (PPC for brevity) to add promotional
coupons to users with expired and consumed coupons (all users with a
project and no active coupons should get a new coupon when PPC is called)
* Call PPC at the end of the `create-invoice-coupons` stage of invoice
generation - after current coupons are processed and expired/exhausted.
* Remove legacy admin functionality for PPC from satellite/console - we
do not currently use it, but if we did, it should be in satellite/admin
instead.
Change-Id: I77727b97bef972df32ebb23cdc05055827076e2a
For business accounts we need to track the sales contact.
It will be a question to business accounts during onboarding.
Change-Id: I8d101ce1b52091478dfb0ddd875e1cc717d765d3
Initially there were pkg and private packages, however for all practical
purposes there's no significant difference between them. It's clearer to
have a single private package - and when we do get a specific
abstraction that needs to be reused, we can move it to storj.io/common
or storj.io/private.
Change-Id: Ibc2036e67f312f5d63cb4a97f5a92e38ae413aa5
cache is really common variable and type name and we have already used
the package name alias in multiple places.
Change-Id: I6435785b7549b541d533de59ec94557b9bd11e04
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
When nodes check in for the very first time, if the satellite can't ping
them back, they are inserted into the nodes table with
last_contact_success of '0001-01-01 00:00:00+00'. If the stray nodes
chore runs before the node can fix their problem, they are DQd.
Solution: when DQing stray nodes, dont DQ where last_contact_success =
'0001-01-01 00:00:00+00'::timestamptz
Change-Id: I477a02d5ef85b2c930ed6b7d99a4d1995169bca8
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
The new default promotional coupon is $10/month, and doesn't expire.
This change also migrates the coupon.duration column over to the new
coupon.billing_periods, and switches to rely completely on
billing_periods.
Change-Id: Ic3341e9fa4040449bab5e66ca4ee2640b095cf3d
Currently we can have an error about duplicated entry while inserting into value_attributions table. This change is changing simple insert into insert that is doing nothing on conflict.
Change-Id: I3efd8dc0b63115e8e2ed8f4196ccf969ee942295
* Add a nullable billing_periods column in the coupons table
* Add nullable billing_periods column to the currently unused
coupon_codes table
* Drop the duration column from the coupon_codes table
* Replace duration config type so that the default promotional coupon
can be configured to never expire
Zero downtime migration plan:
* Add billing_periods column to coupons and coupon_codes tables (this change)
* After one release, remove all references to the old duration column,
replacing with references to billing_periods. At this point, we can also
change the defult promotional coupon to never expire and migrate over
values from the old duration column.
* After another release, drop the duration column.
Change-Id: I374e8dc9fab9f81b4a5bc681771955662d4c007a
instead of only generating invoices for nodes that had some
activity, we generate it for every node so that we can find
and pay terminal nodes that did not meet thresholds before
we recognized them as terminal.
Change-Id: Ibb3433e1b35f1ddcfbe292c034238c9fa1b66c44
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.
If the flag is set to true, nodes will be suspended if they meet the
requirements.
If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.
Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
Previously we would select a limited number of nodes for DQ in a
CTE and run the update on that set in a single transaction. This
could lead to locking on the table, so instead we select and update
in separate transactions.
Change-Id: I1e802c0845e829eeadcee4fa382f58462515fdb1
This is one step for implementing the free tier:
* Change the default project limit from 10 to 3
* Move storage and bandwidth project usage limits from the metainfo
package to the console package (otherwise there is a cyclical
dependency, and metainfo doesn't use these values anyway)
* Change the default storage usage limit per project from 500gb to 50gb
* Change the default bandwidth usage limit per project from 500gb to 50gb
* Migrate the database so that old users and projects continue to have
the old defaults (10 projects/500gb usage)
Change-Id: Ice9ee6a738bc6410da18c336c672d3fcd0cab1b9
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.
Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
The coupon_codes table will allow for administrators to create new promo
codes associated with coupon information (amount, duration, etc...).
A user will be able to enter a promo code (aka coupon code) in order to
apply a new coupon to their account. The coupon in the coupons table is
linked to the template defined in the coupon_codes table.
Change-Id: I50e49fa92afbc6aa9d01d8a895c069efb59e472b
Migration step 148 will cause errors because we missed some
references to the columns being dropped. Removing the step
altogether causes problems with backwards compatibility tests
because the change already exists in the latest release tag.
To circumvent, we change v148 to an empty migration.
Add methods FindTable and RemoveColumn in private/dbutil/dbschema
Change-Id: Ia527e95b88a88c5dc82800928ce6f8cfb879e334
cockroach is having problems with huge transactions and
having them complete before timeouts or whatever, so
do smaller transactions.
because we can have partial recording of payments which
are not unique, we have to do a thing where we read and
check if it already exists before writing. this is not
concurrency safe.
Change-Id: Ia7d59499a43ce6d70cb2a23754edbdd1b643ef1a
The new 'consistency ge-cleanup-orphaned-data' cli command deleted
orphaned transfer queue items, but not entries in the
graceful_exit_progress table. This will delete orphaned entries
from the exit progress table too.
Change-Id: I5f927aac1f258490678deaf179be92ccfe10fcd8
Turns out that many methods generated with dbx where not used at all.
Lets remove them.
As a next step we can think about dropping tables like:
* user_credit
* offer
Change-Id: Id6cda81a701348db2a6b8b26daa22ae9c4f87cb4
Pregenerate the database schema we should use for most tests.
Currently, Cockroach is slow with regards to migration and it's
better if it happens in as few transactions as possible.
This reduces test time from ~21min to ~15min.
Change-Id: Ife8117053e6b9ecf3c93fe63677edf15d4d7c254
Testing interfaces is slightly clearer when it's in the package needing
the database rather than each individual implementation.
Change-Id: I10334c214a205f7e510b939b4359a2214c4e060a
This PR removes all back-end related referral program code including the
marketing portal.
We will have a separate PR for front-end code and database migration to
drop `offers` and `usercredits` table
Change-Id: If59f952cddfe0558a7dc03a0eac7cc1081517f88
These changes are independently tracked on
https://github.com/storj/storj/tree/jt/migration-reorder
The point of this is to make the distributed column
migration, needed for SNO invoice generation, the very
next one, so we can release it as a point release.
Change-Id: I26e1c03629c4f079b9ad12485e2b71a715d82b3b
Limit bucket name lookup to date range of the calling methods since we only need distinct bucket names for that time period.
Adds new index and removes an index specific to project ID since it is no longer needed.
Change-Id: Ic07bbfb1c32280e0c0e39f8da020b284e1e5d974
Fix an issue due to copy-paste problem that made that the Graceful Exit
test to be flaky.
The test uses a time created at the beginning of the test for avoiding
to get undeterministic time differences due to the fact of the response
time variation by the DB queries, however some part of the test were
using a current time rather than this base time, so they have been
addressed.
Change-Id: I4786f06209e041269875c07798a44c2850478438
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
Add a command to the satellite for cleaning up the Graceful Exit (a.k.a
GE) transfer queue items of nodes that have exited.
The commit adds to the GE satellite DB a couple of new methods, and its
corresponding test, for performing the operations of the new command.
Change-Id: I29a572a59689d63b24990ac13c52e76d65aaa917
using redash i manually checked that the only times the sum of
the payments does not match the paid column is for 2020-12 and
if it does not match then there are no payments.
Change-Id: I71ce0571de7e38e21548d7d6757b25abc3bfa781
The rollup archiver chore moves bucket bandwidth rollups and
storagenode rollups that are older than a given duration
to two new archive tables.
Change-Id: I1626a3742ad4271bc744fbcefa6355a29d49c6a5
This index is obsolete and duplicates a similiar (project_id, name)
index on the same table.
Moreover, it might confuse CockroachDB which of the two index to use,
which may might affect DB performance.
Change-Id: If8d1df8347714942cea9dca82864ba5f4973bed3
When we observed the value for total piecesizes stored in the network,
we were doing it after converting them to byte-hours, rather than using
the actual piece sizes. This fixes that issue.
Change-Id: I1564d21b519f70eb59f298d97dbd777baf127723
Add ProjectsCursor type for pagination
Add PageCount, CurrentPage, and TotalCount ProjectsPage
This allows us to mimic the logic of GetBucketTotals and the
implementation of BucketUsages in graphql for the new ProjectsByOwnerID
functionality.
Change-Id: I4e1613859085db65971b44fcacd9813d9ddad8eb
Full scope:
private/testplanet,satellite/{overlay,satellitedb}
Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.
There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.
Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
This is the first step in the removal of uptime columns on the
nodes table. These columns are no longer used:
uptime_success_count
total_uptime_count
uptime_reputation_alpha
uptime_reputation_beta
In order to avoid breaking backwards compatibility, we need to
remove all references to these columns before removing the columns
themselves from the database. However, since uptime_success_count
and total_uptime_count are NOT NULLABLE, we can't remove them from
the insert statements in the overlay. So we can't remove the columns
because of the references, and we can't remove the references because
the columns can't be null. What a pickle. To remedy this, we will set a
default on the columns. Then we should be able to remove them from the
insert statements
Change-Id: I75f6c56fb7897835bbf29869f86f39de1d9dd345
GetSuccessfulNodeNotCheckedInSince and GetOfflineNodesLimited are overlay methods
which were only used by the previous downtime tracking system which has been removed.
These methods should also be removed.
Change-Id: Idb829d742e1f987e095604423fff656fe581183e
* Separate audit history interface into its own file in the overlay
package
* Add overlay.AuditHistory struct so that internalpb.AuditHistory is
only used from within the database layer
* Add overlay.GetAuditHistory function for features that will require
access to detailed audit history information
* Do not return full audit history from UpdateAuditHistory - callers to
that function only need to know the online score and whether a full
tracking period has been completed
* Move audit history tests out of satellite/satellitedb, since they are
independent of database implementation
Change-Id: I35b0c4ac23bbaabd80624f8a9631c3cb1a1f33bd
Now that the deprecated downtime tracking service is removed
(3fc76f4ffe), we can safely remove
the nodes_offline_times table.
Change-Id: Ia7c6efe32ba104dff5a830af5f2beee3337eefe5
Nodes which are offline_suspended will no longer be considered for new
uploads. The current threshold that enters a node into offline
suspension is 0.6. Disqualification for offline suspension is still
disabled.
Change-Id: I0da9abf47167dd5bf6bb21e0bc2186e003e38d1a
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.
The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).
2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..
This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.
Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
We need to be able to list all buckets in DB without knowing project ID.
This method will be used to list buckets for metainfo loop
implementation based on metabase.
Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.
This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.
Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
Which database access and how it internally does migrations is an
implementation detail and does not belong in the requirements interface.
Change-Id: Ia4a6994f39470063a96a8e5f3a1bd27aa79fe5cd
these are unchanged from storj.io/dbx.
we're importing them because in a later commit we
will change them, and it'd be nice to see that
diff as a separate commit.
Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.
Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b