A few weeks ago it was discovered that the segment health function
was not working as expected with production values. As a bandaid,
we decided to insert the number of healthy pieces into the segment
health column. This should have effectively reverted our means of
prioritizing repair to the previous implementation.
However, it turns out that the bandaid was placed into the code which
removes items from the irreparable db and inserts them into the repair
queue.
This change: insert number of healthy pieces into the repair queue in the
method, RemoteSegment
Change-Id: Iabfc7984df0a928066b69e9aecb6f615253f1ad2
there were two changes to this package: one that modified
some things and renamed access.go to main.go, and one that
reduced the binary sizes against main.go.
somehow the latter change was included on this branch but
the former was not, which made the folder contain both
main.go and access.go. building the package then fails
with duplicate definitions.
the easiest fix is to remove access.go which makes the folder
match the contents of the current master branch.
Change-Id: I7070a0d25a0399cef3166b8b2189427bc55587e0
There is a new checker field called statsCollector. This contains
a map of stats pointers where the key is a stringified redundancy
scheme. stats contains all tagged monkit metrics. These metrics exist
under the key name, "tagged_repair_stats", which is tagged with the
name of each metric and a corresponding rs scheme.
As the metainfo observer works on a segment, it checks statsCollector
for a stats corresponding to the segment's redundancy scheme. If one
doesn't exist, it is created and chained to the monkit scope. Now we can call
Observe, Inc, etc on the fields just like before, and they have tags!
durabilityStats has also been renamed to aggregateStats.
At the end of the metainfo loop, we insert the aggregateStats totals into the
corresponding stats fields for metric reporting.
Change-Id: I8aa1918351d246a8ef818b9712ed4cb39d1ea9c6
Make changes so that we only import the necessary files from the console package so that the generated wasm code is as small as possible.
This change gets the compiled wasm code down to 8.6MB uncompressed and 2MB when compressed with `gzip --best`.
https://review.dev.storj.io/c/storj/storj/+/3396
Change-Id: Ifdd4be285810757b46bbbe43327c0d0139e5f8f7
Remove a declared variable that's set by never read nor passed to any
function so it's unused code.
Change-Id: I8daf9d1f71d29ab39d7a80011d1b4813ada1c67d
We need to be able to update just remote_pieces column in DB. This is
needed at least for repair process.
Change-Id: I20dcc9b06babfefbbf102f32b1d14946379f26c2
It was designed to detect and remove zombie segments in the PointerDB.
This tool should be not relevant with the MetabaseDB anymore.
Change-Id: I112552203b1329a5a659f69a0043eb1f8dadb551
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.
The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).
2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..
This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.
Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
We need to be able to list all buckets in DB without knowing project ID.
This method will be used to list buckets for metainfo loop
implementation based on metabase.
Change-Id: Iac75af0eee4f31e80a15577575a8249cbca787b2
- TestBucketNameValidation
- TestBatch
- TestCommitObjectMetadataSize
- TestIDs
TestOverwriteZombieSegments is removed as not relevant to metabase.
Change-Id: I13cf5abe342089960628f185061303fd4f9d09a4
This also removes the
TestEndpoint_DeleteObjectPieces_ObjectWithoutLastSegment test case as it
does not seem relevant to metabase.
Change-Id: I06a0ecaa8232c10c15e433517a7ba056933bf858
We should set the client requested maxParts to MaxListLimit if it is
greater than that value instead of returning an error.
MinIO default value for maxParts is 10,000 while the satellite's
MaxListLimit is 1,000. If we return an error, the ListParts with default
maxParts will throw an error.
Change-Id: I06739e1d8d8f96803eba491585395da0443aec04
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.
This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.
Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
We have some issues with SUBSTRING function on cockroachdb so for now we
are removing it from SQL query and replacing with go code.
Change-Id: I5be921211067d42e7d1a4997076bcfdbed9617a1
We want to stop using the serial_numbers table in satelliteDB. One of the last places using the serial_numbers table is when storagenodes settle orders, we look up the bucket name and project ID from the serial number from the serial_numbers table.
Now that we have support to add encrypted metadata into the OrderLimit, this PR makes use of that and now attempts to read the project ID and bucket name from the encrypted orderLimit metadata instead of from the serial_numbers table. For backwards compatibility and to ensure no errors, we will still fallback to the old way of getting that info from the serial_numbers table, but this will be removed in the next release as long as there are no errors.
All processes that create orderLimits must have an orders.encryption-keys set. The services that create orderLimits (and thus need to encrypt the order metadata) are the satellite apiProcess, the repair process, audit service (core process), and graceful exit (core process). Only the satellite api process decrypts the order metadata when storagenodes settle orders. This means that the same encryption key needs to be provided in the config for the satellite api process, repair process, and the core process like so:
orders.include-encrypted-metadata=true
orders.encryption-keys="<"encryptionKeyID>=<encryptionKey>"
Change-Id: Ie2c037971713d6fbf69d697bfad7f8b672eedd66
iterator
This method replaces `deleteByPrefix` as at the moment only function of
this method was to delete objects in a bucket.
Change-Id: I5266103672003fbd64f3847f53760b1ba0016fe2
Which database access and how it internally does migrations is an
implementation detail and does not belong in the requirements interface.
Change-Id: Ia4a6994f39470063a96a8e5f3a1bd27aa79fe5cd
WHAT:
POST request to get gateway credentials using access grant.
Put request url to config and use it for request.
WHY:
to show gateway credentials on UI
Change-Id: I15ef43ecdeed69b0961d5796aacb47f36d560b1b
these are unchanged from storj.io/dbx.
we're importing them because in a later commit we
will change them, and it'd be nice to see that
diff as a separate commit.
Change-Id: I8315130ed6bab397bd65b9a1a90c29d130b8c02d
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.
Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
We did not test the SegmentHealth function with actual production
values, and it turns out that values such as 52 healthy, 35 minimum
result in +Inf segment health - so pretty much all segments put into the
repair queue have the same health, which means we effectively aren't
sorting by health.
This change inserts numHealthy as segment health into the database so
the segments are ordered as they were before. We need to refine the
SegmentHealth function before we can support multi RS.
Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74
Because the PieceTracker receives a piece count per nodes which is an
approximation of the number of nodes that they are going to be reported
by the metainfo loop so we can use as a good guess of the map's size and
initialized with it.
Change-Id: I644db40926c03e4c457457fb41d2ec1da059cea6
MetadataSize can slightly vary and checking for exact value makes
difficult to change what's being encoded in metadata.
Change-Id: I5f1ade41bc26d115e6743367ee35cf1ba74795c9
Otherwise, if left to default version 0, the iterator will include the
cursor item in the result, which fails some tests.
Change-Id: I85103a36852477f371ec46c673a82c2e129978b7
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.
For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.
Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
Currently flag parsing seems to call Set twice, which causes problems
with encryption keys. We can clear for every set for now.
Change-Id: Id5c695b4020194ac1c50a2da9c7d2a896cb9216f
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.
The new format for RS override values is
"k/o/n-override,k/o/n-override..."
Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
Before manipulating order information on storagenodes we need to wait
for the orders to propagate to the database. Some of that happens
async with uplink.
Change-Id: Iaacfd7db0909ab5d2831d06388e5fb27b6d4778f
Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.
Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.
Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
Old iterator returns object keys without prefixes, this helps to reduce
the bandwidth from the database. The endpoint also doesn't send the
prefixes.
Change-Id: I77d85dae671ee3a16abe75db14e19674e80abaf4
to metabase
* EncryptedMetainfoEncryptedKey added to CommitSegment and
UpdateMetadata request
* EncryptedMetainfoEncryptedKey returned with GetObject response and all
delete responses
* EncryptedMetainfoEncryptedKey returned with object iterator results
Change-Id: I917541ab5f3e1863bc8f238d17a15fbf72a23025
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.
Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
This change is adjusting metainfo endpoint to use metabase for uploading
and downloading remote objects. Inline segments will be added later.
Change-Id: I109d45bf644cd48096c47361043ebd8dfeaea0f3
This PR does the following three things:
1. Defines a high-level interface for this wasm package
- All return value from this package will be wrapped with an
result object that contains a value field and an error field
2. Exposes two new functions to allow users to add permissions for a
given API key
- newPermission()
- setAPIKeyPermission()
3. Adds API documentation for the newly added API functions
Change-Id: Id995189702b369bba18fa344bef4ddfb0f3f1f44
While resolving conflicts with `master` I missed this change which is
needed e.g. to run storj-sim.
Change-Id: I56a548ed92b978510526c26c81af03051acfde2f
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
A change was made to use a metabase.SegmentKey (a byte slice alias)
as the last seen item to iterate through the irreparable DB in a
for loop. However, this SegmentKey was not initialized, thus it was
nil. This caused the DB query to return nothing, and healthy segments
could not be cleaned out of the irreparable DB.
Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81
WHAT:
change user's email endpoint and appropriate service method was implemented
WHY:
make it possible to change user's email for temporary filezilla account
Change-Id: Ieea41bf49819a42b5f433e8dfaeec24c6d5ddc9f
Storage nodes undergoing Graceful Exit have up to now been receiving
hostnames for all other storage nodes they need to contact when
transferring pieces. This adds up to a lot of DNS lookups, which
apparently overwhelm some home routers. There does not seem to be any
need for us to send hostnames for graceful exit as opposed to IP
addresses; we already use IP addresses (as given by the last_ip_port
column in the nodes table) for all the GET and PUT orders we send out.
This change causes IP addresses to be used instead.
I started trying to construct a test to ensure that the behavior
changed, but it was rabbit-holing, so I've begun to feel that maybe this
change doesn't require one; it is a very simple change, and very much of
the same nature as what we already do for IPs in CreateGetOrderLimits
and CreatePutOrderLimits (and others).
Change-Id: Ib2b5ffe7a9310e9cdbe7464450cc7c934fa229a1
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
After moving SatStreamID and SatSegmentID from common I missed changing
some methods in metainfo endpoint. This change is a fix for that.
Change-Id: I34e121fce47371ee4cfd92cce03809520b68859f
After moving SatStreamID and SatSegmentID from common I missed changing
some methods in metainfo endpoint. This change is a fix for that.
Change-Id: I3344623dc7acfa73db6c20cd3212301e74335857
We have some types that are only valid for satellite usage. Such types
are SatStreamID and SatSegmentID. This change moves those types to
storj/storj and adds basic infrastructure for generating code.
Change-Id: I1e643844f947ce06b13e51ff16b7e671267cea64
Some of metainfo endpoint methods are not used but we still have
implementation there. This change removes unused code and returns
unimplemented error for those methods.
Change-Id: I74e75e0caff76a4f5d119ee989b687b4e9d6e6f9
This change removed unused 'createRequests' struct. As far I remember it
was used to help validating old metainfo beginObject/commitObject flow.
Change-Id: I0f139b9934196d73f26eafa347ba5605722f3a55
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.
After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service
It would require additional refactoring in these two services before we
are able to clean this.
Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.
Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
This fixes a slow query that was taking up to 4 seconds in production
SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count
FROM graceful_exit_transfer_queue
WHERE node_id = '[redacted]'
AND finished_at is NULL
AND last_failed_at is NULL
ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0;
Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549
Fix the DeleteAccount handler to return 501 HTTP status code because
it's what corresponds for a "Not Implemented" status.
Add a black box test for the DeleteAccount to ensure that always return
an error response because, at this time, we don't allow to delete
accounts through the API.
This test was not added to the corresponding commit
https://review.dev.storj.io/c/storj/storj/+/2712 due to the rush to
fix it.
Change-Id: Ibcf09e2ec52f182a8a580d606c457328d94c8b60
A year ago we made the audit service deleting expired segments.
Meanwhile, we introduced an expired deletetion sub-service in the
metainfo service which sole purpose is deleting expired segments.
Therefore, now we are removing this responsibility from the audit
service. It will continue to avoid reporting failures on expired
segments, but it would not delete them anymore.
We do this to cleanup responsibilities in advance of the metainfo
refactoring.
Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4
In production we are seeing ~115 storage nodes (out of ~6,500) are not using the new SettlementWithWindow endpoint (but they are upgraded to > v1.12).
We analyzed data being reported by monkit for the nodes who were above version 1.11 but were not successfully submitting orders to the new endpoint.
The nodes fell into a few categories:
1. Always fail to list orders from the db; never get to try sending orders from the filestore
2. Successfully list/send orders from the db; never get to calling satellite endpoint for submitting filestore orders
3. Successfully list/send orders from the db; successfully list filestore orders, but satellite endpoint fails (with "unauthenticated" drpc error)
The code change here add the following to address these issues:
- modify the query for ordersDB.listUnsentBySatellite so that we no longer select expired orders from the unsent_orders table
- always process any orders that are in the ordersDB and also any orders stored in the filestore
- add monkit monitoring to filestore.ListUnsentBySatellite so that we can see the failures/successes
Change-Id: I0b473e5d75252e7ab5fa6b5c204ed260ab5094ec
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.
A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.
The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.
Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.
Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.
So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.
We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.
Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
This change allows the creation and deletion of api keys via the admin API.
It adds two methods for deletion, one via the name and projectID and the
second one via the serialized apikey directly.
Change-Id: Ida8aa729e716db58c671a901e5f7e39253e89a0d
We are moving an error into rejectErr since its preventing storage nodes from being able to settle other orders.
Change-Id: I3ac97c340e491b127f5e0024c5e8bd9f4df8d5c3
Use tagsql.DB pointer as step database, to propagate changes
back and forth between actual database and migration.
Adds CreateDB operation to the migration step to be able to
create new dbs before executing migration action.
Adjusts storagenode database migration to use inner tagsql.DB
pointer of each database as step.DB.
Adjusts satellite dabase migration, adds proxy migrationDB field
to satellite db that wraps itself as tagsql.DB, pointer of which
is used as step.DB.
Change-Id: Ifed4de5b01a356cf7b37db64d2eaeb7b61982c5c
This change completes the column migration of
5f6fccc6e8 and
2f648fd981.
It resets every users project limits who are below or equal to our
current production defaults.
Change-Id: Ie041d08bb67b62844f6023190fc00bc2dad5b1cb
WHAT:
Now project amount limit is taken from users db instead of config. But if db value is 0 then default config value will be used instead.
WHY:
this will allow us to change user's project limit by changing db value.
Change-Id: I9edcd0bf9eaae5fe40e90a44cac82d9ce8519274
This makes it possible to remove of this obsolete flag from the
multi-tenant gateway.
As a consequence, displaying the GATEWAY_0_ACCESS env var will always
require a running storj-sim. Until now, it was required only the first
time. Then the value was stored in the 'access' config. But this is now
not possible anymore.
The changes in StripeMock are required to fix failures in integration
tests. StripeMock is in-memory and its data does not survive restarts of
storj-sim. The second and following starts of storj-sim had invalid
state of StripeMock, which failed requests that were required to
populate the GATEWAY_0_ACCESS env var. The changes in StripeMock makes
it repopulate the Stripe customers from the database.
Change-Id: I981a208172b76577f12ecdaae485f5ae4ea269bc
The same was that our Admin API handles project and account deletions currently, we would like
to have the same checks on the user-facing API. This PR adds the same checks to the console service.
General more applicable checks have been moved directly into the payments service.
In addition it adds the BucketsDB to the console DB, to have easier access and avoiding import cycles with
the metainfo package.
A small cleanup around our unnecessary monkit imports made it in as well.
Change-Id: I8769b01c2271c1687fbd2269a738a41764216e51
holding it during node i/o means slow nodes can hold up order
processing for everyone else. this dramatically increases
the amount of tiem spent handling orders.
Change-Id: Iec999b7ed0817c921a0fd039097a75bdd3c70ea2
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.
Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
we have thundering herds of order submissions that take all of the
database connections causing temporary periodic outages. limit
the amount of concurrent order processing to 2.
Change-Id: If3f86cdbd21085a4414c2ff17d9ef6d8839a6c2b
Adds membership checks for the following calls:
- GetProject
Add ownership checks for the following calls:
- DeleteProject
It also disables the API endpoint to delete a project.
Furthermore it adds tests for the console service.
Change-Id: I1ffc8dcb44746a74ad06a7dbd064a29c57c25272
The VerifyPieceHashes method has a sanity check for the number pieces to
be removed from the pointer after the audit for verifying the piece
hashes.
This sanity check failed when we executed the command on the production
satellites because the Verify command removes Fails and PendingAudits
nodes from the audit report if piece_hashes_verified = false.
A new temporary UsedToVerifyPieceHashes flag is added to
audits.Verifier. It is set to true only by the verify-piece-hashes
command. If the flag is true then the Verify method will always include
Fails and PendingAudits nodes in the report.
Test case is added to cover this use case.
Change-Id: I2c7cb6b12029d52b2fc565365eee0826c3de6ee8
To avoid further name collisions, the very broad named package gets moved into
the consoleauth package where its also mainly being used.
Change-Id: Ie563c9700adbf0553baca2b7b8ba4a1d9c29d144
This change adds the capabilities to adjust the users project limit via the Admin API.
Adds a test for the new added function of the API and updates the existing tests.
It renames the json field on the user struct to be more consistent.
Change-Id: I9018acd80dae0af68d1d50526f20987132c654f3
Previously if a node did not have audit history data for each of the
windows over the tracking period, we would give them the benefit of
the doubt and set their score to 1. This was to prevent nodes from
being suspended right out the gate. We need a minimum amount of data
to evaluate them.
However, a node who is actually failing at being online will have no
idea until they have received enough audits and we suspend them.
Instead, we will always use their real score, but use a flag to determine
whether they are eligible for suspension/dq.
Change-Id: I382218f12e8770f95d4bcddcf101ef348940cadf
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.
solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.
Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
Jira: https://storjlabs.atlassian.net/browse/PG-69
There are a number of segments with piece_hashes_verified = false in
their metadata) on US-Central-1, Europe-West-1, and Asia-East-1
satellites. Most probably, this happened due to a bug we had in the
past. We want to verify them before executing the main migration to
metabase. This would simplify the main migration to metabase with one
less issue to think about.
Change-Id: I8831af1a254c560d45bb87d7104e49abd8242236
We have configlock_test for checking changes so the separate script and
makefile target are not needed.
Change-Id: I2bbc1c21ad849c9b7ec8bba43c0e11e94e04f6a6
Currently Cockroach migration test is the most heavy with regards to
schema changes. This causes other tests to time out. This adds an
alternate cockroach instance that is used for migration tests.
Change-Id: I01fe9313527ff002f0bb0914dd52c3645b8eaf6d
Currently a user is only able to create a project if either
a STORJ deposit or CC was added to his account. With this change, an existing
coupon is also valid to let the user proceed.
Change-Id: I7be8d2d9ec58a15c50755b3fe33af04d2fd64ea2
This PR adds the following items:
1) an in-memory read-only cache thats stores project limit info for projectIDs
This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out.
The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache.
The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period..
Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f
Our current endpoints bail on us, if the column data is null. Thus we need
to take the intermediate step and set the default to a fixed value and
reset those with the following release.
It sets the default column value to our current config values of 50GB
for storage and bandwidth and 100 buckets, while still enabling the field to be nullable.
All 0 values are migrated to be the default as well to ensure they can
keep using their projects, as with the original change, 0 actually means 0.
Change-Id: I797be80ce2d2105091599dc1b3fc76f74336b66b
Currently we have no way to actually set one
of the following limits to 0 (meaning not usable):
- maxBuckets
- usageLimit
- bandwidthLimit
With having the field nullable,
NULL corresponds to the global default,
0 now actually 0 and
a set value determines a custom limit.
Change-Id: I92bb77529dcbd0881ae8368921be9d246eb0919e
Jira: https://storjlabs.atlassian.net/browse/PG-67
There are a number of old-style objects (without the number of segments
in their metadata) on US-Central-1, Europe-West-1, and Asia-East-1
satellites. We want to migrate their metadata to contain the number of
segments before executing the main migration to metabase. This would
simplify the main migration to metabase with one less issue to think
about.
Change-Id: I42497ae0375b5eb972aab08c700048b9a93bb18f
WHAT:
added functionality for user to update project name. Logic only, without actual GUI updates.
WHY:
better user experience
Change-Id: I1e38e33ba827b0bdf2c89e29de24e4e87edb474a
nodes are submitting using both the legacy and windowed endpoints
and thus having their legacy submissions rejected.
it is legal to use both the legacy and windowed endpoints
in phase1 since they use the same backend. the legacy endpoint
is disabled in phase2 and phase3.
therefore, if we wait an order expiration period (2 days) after
we determine enough nodes have started using the windowed
endpoint, we can be sure that any orders they did have to
submit with the legacy endpoint will have expired.
Change-Id: I4418a881bf8bb9377efaef4c651e6103a5dc6ed0
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)
Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
objectdeletion.ObjectIdentifier with metabase.ObjectLocation
Another change to use metabase.ObjectLocation across satellite codebase
to avoid duplication and provide better type safety.
Change-Id: I82cb52b94a9107ed3144255a6ef4ad9f3fc1ca63
WHAT:
notification bar added to project dashboard page. It is shown when projects count limit is reached.
Create project button is removed after creating last available project
WHY:
inform user that their projects count limit was reached
Change-Id: If0d67148003be40cc9eb4d8b25cc17f8204008d4
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.
Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
metabaseSegmentKey` TransferQueueItem
We are unifying which name (and type) we are using for value we are
using to point to segment. We want to use `key` instead of `path`.
Dedicated type `metabase.SegmentKey` was created for this purposes also.
This change is doing refactoring around gracefulexit.
Change-Id: I90d51ff087b206179e61d5f1bc95f4709d76f917
This PR updates `uplink rb --force` command to use the new libuplink API
`DeleteBucketWithObjects`.
It also updates `DeleteBucket` endpoint to return a specific error
message when a given bucket has concurrent writes while being deleted.
Change-Id: Ic9593d55b0c27b26cd8966dd1bc8cd1e02a6666e
This PR fixes a deadlock that can happen when the number of piece
deletion requests is different from the distinct node count from those
requests. The success threshold should be based on the number of nodes
instead of the amount of requests
Change-Id: I83073a22eb1e111be1e27641cebcefecdc16afcb
This change forces the test of GetObjectIPs to use multiple remote
segments (earlier versions of the test were accidentally using inline
segments). This change also revealed a small bug in the for loop code,
which is fixed.
Change-Id: Ic486b079d221952ba13553acf0ca41a8873f3f21
* The audit worker wants to get items from the queue and process them.
* The audit chore wants to create new queues and swap them in when the
old queue has been processed.
This change adds a "Queues" struct which handles the concurrency
issues around the worker fetching a queue and the chore swapping a new
queue in. It simplifies the logic of the "Queue" struct to its bare
bones, so that it behaves like a normal queue with no need to understand
the details of swapping and worker/chore interactions.
Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b
Add online score used for the new audit history offline tracking system
to the nodes table. This allows us easy access to the node's online
score for the storagenode dashboard as well as for data analysis.
Change-Id: Ie99be1192e5236862a5b3dbed2e5ef03b9169410
We were seeing error on the last day of the month with TestProjectAllocatedBandwidthRetainTwo.
This is due to AddDate normalizes its result in the same way that Date does, so, for example,
adding one month to October 31 yields December 1, the normalized form for November 31."
I also fixed a minor UTC issue with this test as well.
Change-Id: I0157873e7befa57810e5f264a922b188890fa46a
satellite.DB.Console().Projects().GetAll database query
can be replaced with planet.Uplinks[0].Projects[0].ID
Change-Id: I73b82b91afb2dde7b690917345b798f9d81f6831
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.
In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.
Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.
Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
Currently there is confusion between responsibilities of
metainfo.Endpoint, metainfo.Service, PointerDB.
By separating database "service" into a separate package and
its types allows to disentagle them.
This gives us responsibilities:
1. metainfo.Endpoint - translates requests and permissions
2. metainfo.Service - handles requests and coordinates with
objectdeletion, piecedeletion, metabase
3. metabase.Service - communication with the database interface and invariants
Currently metabase will contain the types necessary to coordinate
information.
Change-Id: If8c992b4b9d9e70a56bbd8a378a5af6b1a2ec34e