Commit Graph

606 Commits

Author SHA1 Message Date
JT Olio
71e11b27f3 satellite/dbx: only retry with cockroach
Change-Id: Id3630c26dbfda36dcbece2849e2353d5ab2882af
2020-11-29 18:10:07 -07:00
JT Olio
bd23d12bb9 satellite/dbx: add cockroach retries for other QueryContext operations
Change-Id: Ia30fbba55c926892702fa96fb9dd01b75347d351
2020-11-29 18:09:56 -07:00
JT Olio
ea2f39ca7f satellite/dbx: add retries for QueryRowContext-based operations
Change-Id: Ie2527b673dd4ce5250cf5c0cbf8f14921262f665
2020-11-29 18:09:46 -07:00
JT Olio
d3b0691bbd satellite/dbx: import dbx templates
these are unchanged from storj.io/dbx.

we're importing them because in a later commit we
will change them, and it'd be nice to see that
diff as a separate commit.

Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d
2020-11-29 18:09:33 -07:00
JT Olio
5d8a67a4f7 satellitedb: retry GetBandwidthSince on cockroach
Change-Id: I2bf20f3a19e7f3af97630d8a679410feba70661e
2020-11-29 16:36:15 -07:00
Ethan
5dc013d3bd satellite/overlay: Add retry to all selects in overlaycache
Change-Id: I0356d71a35701f8e0ca04a34b2bb2aea666c1394
2020-11-29 16:46:57 -05:00
JT Olio
6bce907cb0 satellite: try to stream rollups to aggregation function to use less memory
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.

Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
2020-11-29 10:26:32 -07:00
JT Olio
6aae21541f satellitedb: do saverollup in batches
Change-Id: I78278a192cba60541eee2986f54a88d5a479bd3e
2020-11-28 19:26:46 -07:00
JT Olio
0ba516d405 satellite: support pointing db components at different databases
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.

Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
2020-11-28 18:39:16 +00:00
Egon Elbre
55d5e1fd7d satellite/orders: ensure that expired deletion doesn't stall
Add checks to ensure that when somebody uses empty options, the deletion
doesn't loop infinitely.

Change-Id: I1738fb1e7e1f8efbbb954c491cb6489f7bcdc2db
2020-11-23 14:52:40 +02:00
Ethan
2b92bba563 satellite/satellitedb/orders: Handle serial_numbers deletes in smaller increments on CRDB
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours.  This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.

Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
2020-11-20 13:44:52 +00:00
Moby von Briesen
a8b66dce17 satellite/accounting: account for old orders that can be submitted in satellite rollup
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.

A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.

Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
2020-11-18 14:46:00 -05:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
Jessica Grebenschikov
f558cc825e satellite/orders: add storagenode_bw_phase2 table and dont delete tallies for longer
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.

This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.

These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.

Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
2020-11-13 17:15:24 +00:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
7183dca6cb all: fix defers in loop
defer should not be called in a loop.

Change-Id: Ifa5a25a56402814b974bcdfb0c2fce56df8e7e59
2020-11-02 15:06:38 +02:00
Egon Elbre
11338e9beb satellite/internalpb: move audithistory.pb
Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4
2020-10-30 15:30:11 +02:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Egon Elbre
004e610d0f satellite/internalpb: move datarepair.pb to internal
Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d
2020-10-30 13:28:14 +02:00
Egon Elbre
caefde6b32 private/{dbutil,tagsql}: pass ctx to database opening
Database opening usually dial and hence we should pass ctx to them.

Change-Id: Iaa2875981570d83e65be3710f841cf30349f807b
2020-10-29 10:51:29 +00:00
Egon Elbre
e3985799a1 storage/{cockroachkv,postgreskv}: add ctx to opening
Database opening usually dial and hence we should pass ctx to them.

Change-Id: Iecf41241aaa94d54506cbc80b0e53449848d8819
2020-10-29 10:49:08 +00:00
Egon Elbre
9b2e00a38b satellite: pass ctx into satellitedb.Open
Opening a database requires ctx, this is first step to passing ctx
to the appropriate level.

Change-Id: Ic303e69f868ef3449ae36377937a29670cf635e2
2020-10-29 06:38:37 +00:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Ethan
9a29ec5b3e Add index to graceful_exit_transfer_queue table
This fixes a slow query that was taking up to 4 seconds in production

SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count
	FROM graceful_exit_transfer_queue
	WHERE node_id = '[redacted]'
	AND finished_at is NULL
	AND last_failed_at is NULL
	ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0;

Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549
2020-10-26 14:47:48 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Yaroslav Vorobiov
139a7ee959 private/migrate: add ablity to create dbs during migration
Use tagsql.DB pointer as step database, to propagate changes
back and forth between actual database and migration.
Adds CreateDB operation to the migration step to be able to
create new dbs before executing migration action.
Adjusts storagenode database migration to use inner tagsql.DB
pointer of each database as step.DB.
Adjusts satellite dabase migration, adds proxy migrationDB field
to satellite db that wraps itself as tagsql.DB, pointer of which
is used as step.DB.

Change-Id: Ifed4de5b01a356cf7b37db64d2eaeb7b61982c5c
2020-10-15 15:28:04 +03:00
Stefan Benten
0b43b93259 satellite/satellitedb: make limits per default NULL
This change completes the column migration of
5f6fccc6e8 and
2f648fd981.
It resets every users project limits who are below or equal to our
current production defaults.

Change-Id: Ie041d08bb67b62844f6023190fc00bc2dad5b1cb
2020-10-14 20:28:16 +00:00
Egon Elbre
2268cc1df3 all: fix linter complaints
Change-Id: Ia01404dbb6bdd19a146fa10ff7302e08f87a8c95
2020-10-13 15:59:01 +03:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Jeff Wendling
0f0faf0a9f satellite/orders: do a better job limiting concurrent requests
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.

Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
2020-10-09 16:27:15 -04:00
Jeff Wendling
7c303208ff satellite/satellitedb: emergency temporary order processing semaphore
we have thundering herds of order submissions that take all of the
database connections causing temporary periodic outages. limit
the amount of concurrent order processing to 2.

Change-Id: If3f86cdbd21085a4414c2ff17d9ef6d8839a6c2b
2020-10-08 19:16:47 +00:00
Cameron Ayer
b39a99bae6 satellite/{overlay,satellitedb}: always show node's real online score
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.

However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.

Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.

Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
2020-10-02 12:28:11 -04:00
Cameron Ayer
c2525ba2b5 satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.

solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.

Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
2020-09-29 20:38:22 +00:00
Egon Elbre
c23a8e3b81 go.mod: update pgx to v4.9.0
Fix query to use TextArray instead of VarcharArray.
Fix queries to use the correct type.

Change-Id: Ibb7e55adba277d05778118d81ca697470e72c374
2020-09-29 19:03:08 +00:00
Egon Elbre
2d27bc8787 satellite/satellitedb: separate cockroach for migration tests
Currently Cockroach migration test is the most heavy with regards to
schema changes. This causes other tests to time out. This adds an
alternate cockroach instance that is used for migration tests.

Change-Id: I01fe9313527ff002f0bb0914dd52c3645b8eaf6d
2020-09-29 09:31:33 +00:00
Jessica Grebenschikov
4a2c66fa06 satellite/accounting: add cache for getting project storage and bw limits
This PR adds the following items:
1) an in-memory read-only cache thats stores project limit info for projectIDs

This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out.

The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache.

The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period..

Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f
2020-09-25 16:28:49 +00:00
Stefan Benten
38108828ac satellite/satellitedb: enable multiple projects existing users
Change-Id: I2ef77182d5464d72574698c8abfbbfdbda3f5a9e
2020-09-23 18:17:38 +02:00
Stefan Benten
5f6fccc6e8 satellite/satellitedb: makes limits nullable change backwards compatible
Our current endpoints bail on us, if the column data is null. Thus we need
to take the intermediate step and set the default to a fixed value and
reset those with the following release.

It sets the default column value to our current config values of 50GB
for storage and bandwidth and 100 buckets, while still enabling the field to be nullable.
All 0 values are migrated to be the default as well to ensure they can
keep using their projects, as with the original change, 0 actually means 0.

Change-Id: I797be80ce2d2105091599dc1b3fc76f74336b66b
2020-09-23 17:54:42 +02:00
Stefan Benten
2f648fd981 satellite: make limits be nullable
Currently we have no way to actually set one
of the following limits to 0 (meaning not usable):

- maxBuckets
- usageLimit
- bandwidthLimit

With having the field nullable,
NULL corresponds to the global default,
0 now actually 0 and
a set value determines a custom limit.

Change-Id: I92bb77529dcbd0881ae8368921be9d246eb0919e
2020-09-21 19:34:19 +00:00
Qweder93
8182fdad0b storagenode: heldamount renamed to payouts, renamed some methods and structs to more meaningful names. grouped estimated payout with pathouts
satellite: heldamount renamed to SNOpayouts.

Change-Id: I244b4d2454e0621f4b8e22d3c0d3e602c0bbcb02
2020-09-16 14:57:35 +00:00
Cameron Ayer
e7c34a053d satellite/satellitedb: add column and index "updated_at" to injuredsegments
Change-Id: I59e9bb2077885f09e17795375fe98ed31bd83d54
2020-09-14 12:53:04 -04:00
Michal Niewrzal
27a9d14e2a satellite/repair: use metabase.SegmentKey type in repair package
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)

Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
2020-09-08 19:35:20 +00:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Michal Niewrzal
8649a00557 satellite/gracefulexit: replace Path []byte to `Key
metabaseSegmentKey` TransferQueueItem

We are unifying which name (and type) we are using for value we are
using to point to segment. We want to use `key` instead of `path`.
Dedicated type `metabase.SegmentKey` was created for this purposes also.
This change is doing refactoring around gracefulexit.

Change-Id: I90d51ff087b206179e61d5f1bc95f4709d76f917
2020-09-04 11:09:48 +00:00
Egon Elbre
dc48197bd8 satellite/orders: add bucket id to order limit
Change-Id: I9019ec77d692e62ac17b67a1da71dc3535cde50c
2020-09-03 10:50:11 +03:00
Michal Niewrzal
0604a672c1 satellite/metainfo: use metabase in loop
Change-Id: I1bb0c6fe0a762895fde950690b06f7dd9d77e178
2020-09-01 10:06:16 +00:00
Moby von Briesen
2d01dd9732 satellite/satellitedb: Add online_score column to nodes table
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.

Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
2020-08-31 15:07:07 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Bill Thorp
729079965f satellite/satellitedb : remove migation steps 69-102
Jenkins has been failing a lot lately due to test timeouts with CockroachDB.
TestMigrateCockroach previously took around 5 minutes, now it takes 2.

Why 103?  I couldn't get 100 to work due to an error w/ NOT NULL and PKs.

Change-Id: Iec95d4e25f9d6cd36920e7f43272c486a17fa879
2020-08-27 07:36:05 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00