Commit Graph

1537 Commits

Author SHA1 Message Date
Jessica Grebenschikov
0649d2b930 satellite/repair: improve contention for injuredsegments table on CRDB
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.

The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).

2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..

This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.

Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
2020-12-10 09:51:26 -08:00
Michal Niewrzal
c2a97aeb14 satellite/satellitedb: add ListAllBuckets method
We need to be able to list all buckets in DB without knowing project ID.
This method will be used to list buckets for metainfo loop
implementation based on metabase.

Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2
2020-12-10 14:19:27 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Ethan Adams
f90ea10a4a
Allow for DB application names per process. (#3983) 2020-12-04 11:24:39 +01:00
Moby von Briesen
d75e4be11f satellite/{accounting, contact}: Remove periods and spaces from metrics.
Change-Id: I84179c2931293e3a1eb0ff8050416d25e481ce07
2020-12-03 15:33:01 +00:00
Moby von Briesen
3fc76f4ffe satellite/downtime: Remove deprecated downtime tracking service.
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.

This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.

Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
2020-12-02 15:16:13 -05:00
JT Olio
1728c3a992 satellite/dbx: standardize on assignment
Change-Id: I8f87bc8391e765e4480b0590d92d3601248e1f93
2020-12-01 16:10:18 +00:00
Jessica Grebenschikov
b261110352 satellite/orders: get bucketID from encrypted metadata in order instead of serial_numbers table
We want to stop using the serial_numbers table in satelliteDB. One of the last places using the serial_numbers table is when storagenodes settle orders, we look up the bucket name and project ID from the serial number from the serial_numbers table.

Now that we have support to add encrypted metadata into the OrderLimit, this PR makes use of that and now attempts to read the project ID and bucket name from the encrypted orderLimit metadata instead of from the serial_numbers table. For backwards compatibility and to ensure no errors, we will still fallback to the old way of getting that info from the serial_numbers table, but this will be removed in the next release as long as there are no errors.

All processes that create orderLimits must have an orders.encryption-keys set. The services that create orderLimits (and thus need to encrypt the order metadata) are the satellite apiProcess, the repair process, audit service (core process), and graceful exit (core process). Only the satellite api process decrypts the order metadata when storagenodes settle orders. This means that the same encryption key needs to be provided in the config for the satellite api process, repair process, and the core process like so:
orders.include-encrypted-metadata=true
orders.encryption-keys="<"encryptionKeyID>=<encryptionKey>"

Change-Id: Ie2c037971713d6fbf69d697bfad7f8b672eedd66
2020-12-01 15:29:32 +00:00
JT Olio
70b91aac54 satellitedb: remove cruft caused by https://review.dev.storj.io/c/storj/storj/+/3223
Change-Id: I198bb2f869cc7177b9ecafdd8932bbf2b58be5b8
2020-12-01 00:16:26 +00:00
Yingrong Zhao
d8ba7b3057 satellite/console: only allow project member to get all bucket names
Change-Id: I8ceb0b7eb19e221072b4ff3411a4ec1a7817d16f
2020-11-30 15:41:35 -05:00
Egon Elbre
f456d7ce03 satellite: remove implementation detail from DB interface
Which database access and how it internally does migrations is an
implementation detail and does not belong in the requirements interface.

Change-Id: Ia4a6994f39470063a96a8e5f3a1bd27aa79fe5cd
2020-11-30 13:29:20 +02:00
Egon Elbre
28ea63be92 satellite/repair: avoid TestDBAccess
Change-Id: I34adb58cd67fba5917032f2f328d75b1c4afdbbf
2020-11-30 13:29:08 +02:00
VitaliiShpital
0771cdb0b1 web/satellite: create access grant: generate gateway credentials step
WHAT:
generate gateway credentials step for create access grant flow

WHY:
part of the flow

Change-Id: I6496712b43f78a818ba0582b586cfae3a44683e6
2020-11-30 10:36:29 +00:00
VitaliiShpital
bb7677a85f web/satellite: get gateway credentials request using url from config
WHAT:
POST request to get gateway credentials using access grant.
Put request url to config and use it for request.

WHY:
to show gateway credentials on UI

Change-Id: I15ef43ecdeed69b0961d5796aacb47f36d560b1b
2020-11-30 10:36:23 +00:00
Michal Niewrzal
21602e0494 satellite/metainfo: enable commented test
Test was commented to make uplink refactoring possible. Now we can bring
back this test.

Change-Id: I0511b76073efaafed8aac97f8e845dcec93dd059
2020-11-30 10:49:23 +01:00
JT Olio
71e11b27f3 satellite/dbx: only retry with cockroach
Change-Id: Id3630c26dbfda36dcbece2849e2353d5ab2882af
2020-11-29 18:10:07 -07:00
JT Olio
bd23d12bb9 satellite/dbx: add cockroach retries for other QueryContext operations
Change-Id: Ia30fbba55c926892702fa96fb9dd01b75347d351
2020-11-29 18:09:56 -07:00
JT Olio
ea2f39ca7f satellite/dbx: add retries for QueryRowContext-based operations
Change-Id: Ie2527b673dd4ce5250cf5c0cbf8f14921262f665
2020-11-29 18:09:46 -07:00
JT Olio
d3b0691bbd satellite/dbx: import dbx templates
these are unchanged from storj.io/dbx.

we're importing them because in a later commit we
will change them, and it'd be nice to see that
diff as a separate commit.

Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d
2020-11-29 18:09:33 -07:00
JT Olio
5d8a67a4f7 satellitedb: retry GetBandwidthSince on cockroach
Change-Id: I2bf20f3a19e7f3af97630d8a679410feba70661e
2020-11-29 16:36:15 -07:00
Ethan
5dc013d3bd satellite/overlay: Add retry to all selects in overlaycache
Change-Id: I0356d71a35701f8e0ca04a34b2bb2aea666c1394
2020-11-29 16:46:57 -05:00
JT Olio
6bce907cb0 satellite: try to stream rollups to aggregation function to use less memory
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.

Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
2020-11-29 10:26:32 -07:00
JT Olio
6aae21541f satellitedb: do saverollup in batches
Change-Id: I78278a192cba60541eee2986f54a88d5a479bd3e
2020-11-28 19:26:46 -07:00
JT Olio
0ba516d405 satellite: support pointing db components at different databases
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.

Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
2020-11-28 18:39:16 +00:00
Moby von Briesen
75f0f713a3 satellite/repair/checker/checker.go: Use number of healthy pieces instead of SegmentHealth for injured segments queue.
We did not test the SegmentHealth function with actual production
values, and it turns out that values such as 52 healthy, 35 minimum
result in +Inf segment health - so pretty much all segments put into the
repair queue have the same health, which means we effectively aren't
sorting by health.

This change inserts numHealthy as segment health into the database so
the segments are ordered as they were before. We need to refine the
SegmentHealth function before we can support multi RS.

Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74
2020-11-28 12:16:32 -05:00
Ivan Fraixedes
7eb3b2d6d0
satellite/gc: Init map with an aprox size
Because the PieceTracker receives a piece count per nodes which is an
approximation of the number of nodes that they are going to be reported
by the metainfo loop so we can use as a good guess of the map's size and
initialized with it.

Change-Id: I644db40926c03e4c457457fb41d2ec1da059cea6
2020-11-27 10:44:19 +01:00
Ivan Fraixedes
319d2cad11
satellite: Fix typo in a comment
Change-Id: I151b824e868db1cc1e8b8e8af9f35b027db1e6ff
2020-11-26 15:44:49 +01:00
Michal Niewrzal
8ceef9f357 satellite/metainfo: temporary disable one assertion in test
This is need to merge https://review.dev.storj.io/c/storj/uplink/+/3208
, after that this code will be back.

Change-Id: If9f2f1db95c7a1bba64a41c45a39bd3096a519e7
2020-11-25 13:21:41 +00:00
Egon Elbre
3792e2921c satellite/accounting/tally: make test less fragile
MetadataSize can slightly vary and checking for exact value makes
difficult to change what's being encoded in metadata.

Change-Id: I5f1ade41bc26d115e6743367ee35cf1ba74795c9
2020-11-25 13:33:24 +02:00
Kaloyan Raev
53b7fd7b00 satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.

For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.

Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
2020-11-24 11:09:48 +02:00
Egon Elbre
9de1617db0 satellite/orders: ensure encryption keys handles set twice
Currently flag parsing seems to call Set twice, which causes problems
with encryption keys. We can clear for every set for now.

Change-Id: Id5c695b4020194ac1c50a2da9c7d2a896cb9216f
2020-11-23 19:47:22 +00:00
Moby von Briesen
575f50df84 satellite/repair: Update repair override config to support multiple RS schemes.
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.

The new format for RS override values is
"k/o/n-override,k/o/n-override..."

Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
2020-11-23 18:01:15 +00:00
Egon Elbre
55d5e1fd7d satellite/orders: ensure that expired deletion doesn't stall
Add checks to ensure that when somebody uses empty options, the deletion
doesn't loop infinitely.

Change-Id: I1738fb1e7e1f8efbbb954c491cb6489f7bcdc2db
2020-11-23 14:52:40 +02:00
Jessica Grebenschikov
5beb2f5737 satellite/orders: add factory function to encryption key
Change-Id: I9a1020c63e4ebc6d73683cf1749366e9b9f20f07
2020-11-20 11:40:15 -08:00
Ethan
2b92bba563 satellite/satellitedb/orders: Handle serial_numbers deletes in smaller increments on CRDB
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours.  This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.

Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
2020-11-20 13:44:52 +00:00
Moby von Briesen
a8b66dce17 satellite/accounting: account for old orders that can be submitted in satellite rollup
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.

A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.

Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
2020-11-18 14:46:00 -05:00
Egon Elbre
aeb801604e {satellite,storagenode}/orders: fix flaky tests
Before manipulating order information on storagenodes we need to wait
for the orders to propagate to the database. Some of that happens
async with uplink.

Change-Id: Iaacfd7db0909ab5d2831d06388e5fb27b6d4778f
2020-11-18 13:44:02 +00:00
paul cannon
2b59640f18 cmd/satellite: ignore Canceled in exit from repair worker
Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.

Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.

Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
2020-11-17 21:37:59 +00:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
VitaliiShpital
51a712f9e8 satellite/console: get all bucket names endpoint and service method
WHAT:
new endpoint for fetching all bucket names

WHY:
used by new access grant flow

Change-Id: I356a3381359665fd2726120139b34b1e611fe3c4
2020-11-16 17:51:40 +02:00
Jessica Grebenschikov
f558cc825e satellite/orders: add storagenode_bw_phase2 table and dont delete tallies for longer
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.

This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.

These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.

Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
2020-11-13 17:15:24 +00:00
Yaroslav Vorobiov
1b4bfbb9d2 multinode/console: nodes addition and removal
Change-Id: I60c685953a8d0e24f78b1414c34a28d4b87863b0
2020-11-12 20:26:08 +02:00
Jessica Grebenschikov
226e13e616 satellite/cosole: add tests for wasm access code
Change-Id: I78f71b2f0bef03b6e87cd7d79ccaef5f45393b55
2020-11-12 08:03:36 -08:00
paul cannon
3e56403599 satellite/repair: add a repair health function
This will be used to rank segments in need of repair for attention by
the repair workers.

Change-Id: I5b70650cec933696b4c6d73bb7efb97e3efdf24a
2020-11-11 18:48:51 +00:00
Jeff Wendling
31533ed1a1 satellite/console/wasm: remove storj.io/uplink deependency
Change-Id: Iee95389e4ba24618e31aff7be44d05377b2e2419
2020-11-11 16:51:14 +00:00
Cameron Ayer
da9f1f0611 satellite/repair: add monkit counter for segments below minimum required
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.

Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
2020-11-11 12:48:23 +00:00
Yingrong Zhao
2ce3170bb4 satellite/console/wasm: expose method to add caveats in the browser
This PR does the following three things:
    1. Defines a high-level interface for this wasm package
        - All return value from this package will be wrapped with an
          result object that contains a value field and an error field
    2. Exposes two new functions to allow users to add permissions for a
       given API key
        - newPermission()
        - setAPIKeyPermission()
    3. Adds API documentation for the newly added API functions

Change-Id: Id995189702b369bba18fa344bef4ddfb0f3f1f44
2020-11-10 20:10:53 +00:00
Brandon Iglesias
3ba52b25a9
satellite/rewards: update partners to include MAXN 2020-11-10 14:08:32 +02:00
Moby von Briesen
db6bc6503d satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes.
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.

RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).

Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
2020-11-09 22:16:13 +00:00
Cameron Ayer
d63b7658e8 satellite/repair: fix lastSeenSegmentKey bug in IrreparableProcess
A change was made to use a metabase.SegmentKey (a byte slice alias)
as the last seen item to iterate through the irreparable DB in a
for loop. However, this SegmentKey was not initialized, thus it was
nil. This caused the DB query to return nothing, and healthy segments
could not be cleaned out of the irreparable DB.

Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81
2020-11-09 14:48:15 +00:00