This change adds the ability to download byte ranges
to uplinkng.
Extended the uplinkng Filesystem interface with Stat
method and an OpenOptions struct as parameter for the
Open method.
Also added a few tests for the ranged download
Change-Id: I89a7276a75c51a4b22d7a450f15b3eb18ba838d4
Gateway-ST frequent release cycle has been resurrected, which means it's
safer to use the latest release tag in the storj repository's CI now.
Change-Id: I9df1c789a9b9418ba7cceaec9cfec3cc6c448284
Currently slower storagenodes can slow down deletion queue.
To make piece deletion faster reduce the maximum time spent in
either dialing or piece deletion requests.
With this change:
* dial timeout is 3s
* request timeout is 15s
* fail threshold is set to 10min
Similarly, we'll mark storage node as failed when the timeout occurs.
The timeout usually indicates that the storagenode is overwhelmed.
Garbage collection will ensure that the pieces get deleted eventually.
Change-Id: Iec5de699f5917905f5807140e2c3252088c6399b
To make our free tier limits more clear, we will reduce the number of projects allowed from 3 to 1, and increase the storage and bandwidth limit of the free tier from 50 Gb to 150 GB. The total allotments across all projects for a given user are unchanged, just reduced to a single project.
Change-Id: Ic8dddb135f2b83a3f36e2b9fdcb477e351ec137b
Multipart upload limits added. Last part has no size limit.
Max number of parts: 10000, min part size: 5 MiB
Change-Id: Ic2262ce25f989b34d92f662bde720d4c4d0dc93d
https://storjlabs.atlassian.net/browse/PG-305
we should extend method move of cmd/uplink/cmd/mv.go
if both parameters end with slash -
i should list all files and call move method in loop
parameters:
uplink mv sj://bucket/a/prefix/ sj://new-bucket/a/new-prefix/
Change-Id: Ic24c2af83153ea60ec74393e65736af094877151
Removes database tables and functionality related to our custom
coupon implementation because it has been superseded by the Stripe
coupon and promo code system. Requires implementations of the
payments Invoices interface to return coupon usages along with
invoices.
Change-Id: Iac52d2ff64afca8cc4dbb2d1f20e6ad4b39ddfde
New command for cli to move object to different
location.
uplink mv sj://bucket/your-object sj://bucket/moved-object
Change-Id: I85a4961aa59f250819954e78f20363ac3c570938
Make bucket command was using full location
specified in command line instead only bucket name.
As an addition change contains basic integration tests
with storj-sim.
Change-Id: Ie3b5283468b7fbde0b1333f01dc4fc2a2952e1a1
Jenkins uses the same folder for different PR-s. Sometimes other runs do
not cleanup after themselves (e.g. timeout), hence add a cleanWs step to
ensure we delete files in the workspace.
gateway-st introduced a replace directive in go.mod, which does not work
with go install. Hardcode to the last version without the directive.
Using this fix to unblock ci builds.
Change-Id: I5e5d75bf47e30a5a8b6d835867c0c9176f25e08a
Use multipart upload to upload single object in parts in
parallel. Its using parallelism flag added earlier.
Change-Id: I45b531a5db43c86f0112a5e3bb4a83bc1d65650f
Adds support for new uplink method DownloadObjectAt which
gives ability to download single object in parallel.
Change-Id: I8388653429992b0d24c383d17d7e90904203fe77
Rate limits application of coupon codes by user ID to prevent
brute forcing. Refactors the rate limiter to allow limiting based
on arbitrary criteria and not just by IP.
Change-Id: I99d6749bd5b5e47d7e1aeb0314e363a8e7259dba
MFA is complete and we are good to enable it in production. This change
removes the flag that disables MFA by default.
Change-Id: I2f985ae501171bdab505d664b43c8cfc248bad8d
package in audit
This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.
In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.
Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
Full path: satellite/{payments,console},web/satellite
* Adds the ability to apply coupon codes from the billing page in the
satellite UI.
* Flag for coupon code UI is split into two flags - one for the billing
page and one for the signup page. This commit implements the first, but
not the second.
* Update the Stripe dependency to v72, which is necessary to
use Stripe's promo code functionality.
Change-Id: I19d9815c48205932bef68d87d5cb0b000498fa70
Bucket tally calculation will be removed from metaloop and will
use metabase objects iterator directly.
At the moment only bucket tally needs objects so it make no sense
to implement separate objects loop.
Change-Id: Iee60059fc8b9a1bf64d01cafe9659b69b0e27eb1
Added feature flag for MFA
Added new client-side api call to enable MFA returning secret
Updated users Vuex module to include new API call
Change-Id: Ia9e10f68c4a7da39b4f7c1073e657c2de98fb0db
The user must complete a reCAPTCHA in order to register.
ReCAPTCHA verification failure results in rejection of the
registration attempt.
Change-Id: I34ba7db414d756fd1aaebdc3d19cccbfc7fc1ea3
When a user adds a credit card, switch them to the paid tier and update
their projects with new bandwidth/storage limits. New projects for the
paid tier user will also have the updated limits.
The new limits are:
* storage per project - 50 GB free/25 TB paid
* bandwidth per project - 50 GB free/100 TB paid
Change-Id: I7d6467d077e8bb2bbe4bcf88ab8d75490f83165e
Because of our free/paid tier plan, we do not need a paywall anymore. We
have not used it in a while, but still have leftover code laying around.
Change-Id: Iaea8c39faf042a2f7a6b837727bb135c8bdf2907
Adding AS OF SYSTEM TIME to query that is calculating project bandiwdth.
As an addition method for setting interval is added as test doesn't
work well with default interval.
Change-Id: Id1e15be4f6afff13b9dc2b7f595e2edb6de28db9
We used this to reduce initial load on the core to avoid OOM. However,
this is not a problem anymore with garbage collection running
separately.
Change-Id: Ifd62c822a74974bc21a5913199334469a4bc0130
This adds verification for the processed count and before and after
segment/objects table counts.
This adds new flag:
metainfo.segment-loop.suspicious-processed-ratio: 0.03
This defaults to 3%, which at 100M segments is 3M segments.
Change-Id: I5ee03e913ddc4e67e94010ced126a2a9ea51f41b
This adds verification for the processed count and before and after
segment/objects table counts.
This adds new flag:
metainfo.loop.suspicious-processed-ratio: 0.03
This defaults to 3%, which at 100M objects is 3M objects.
Change-Id: Ife5522ecc97bcc5a55667f36868a0f1fc8e4c561
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.
Later we still need to drop table with migration step.
Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
Currently we did not limit the "as of system time" for iterating over
objects table. Using just an interval would cause problems with the
tests. That could be overcome skipping that interval for tests
altogether, however, we should probably test those more to ensure that
GC stays working as intended.
This is a safer code, however, maybe not as straigthforward as it could
be.
Change-Id: I374f77783b2af42bb6da846735ceea20a7ce5e60
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
We want to move some of current metainfo loop observers to
segment loop. This change adds new service, similar to metainfo
loop but which is iterating only over segments.
Change-Id: I67f7f461781723a4476e2b83377f31736d7c4870
Rather than applying our internal satellite implementation of coupons
when new accounts are created, use a configured Stripe coupon instead.
If no configuration is set, no coupon will be applied.
This change also removes logic for adding coupons to customers who pay
with crypto - they will already have the free tier coupon applied
anyway.
We will be phasing out our internal coupon implementation.
Change-Id: Ieb87ddb3412acbc74986aa9d18a4cbd93c29861a
Use the 'AS OF SYSTEM TIME' Cockroach DB clause for the Graceful Exit
(a.k.a GE) queries that count the delete the GE queue items of nodes
which have already exited the network.
Split the subquery used for deleting all the transfer queue items of
nodes which has exited when CRDB is used and batch the queries because
CRDB struggles when executing in a single query unlike Postgres.
The new test which has been added to this commit to verify the CRDB
batch logic for deleting all the transfer queue items of the exited
nodes has raised that the Enqueue method has to run in baches when CRDB
is used otherwise CRDB has return the error "driver: bad connection"
when a big a amount of items are passed to be enqueued. This error
didn't happen with the current test implementation it was with an
initial one that it was creating a big amount of exited nodes and
transfer queue items for those nodes.
Change-Id: I6a099cdbc515a240596bc93141fea3182c2e50a9
The previously configured never-expiring coupon does not refill every
month. Eventually, even though it never expires, it will run out. This
commit makes several small changes to address this issue for the free
tier:
* Change the config for the promotional coupon to be $1.65 for 1 month
(the change from $10 to $1.65 is due to our recent pricing changes)
* Update PopulatePromotionalCoupons (PPC for brevity) to add promotional
coupons to users with expired and consumed coupons (all users with a
project and no active coupons should get a new coupon when PPC is called)
* Call PPC at the end of the `create-invoice-coupons` stage of invoice
generation - after current coupons are processed and expired/exhausted.
* Remove legacy admin functionality for PPC from satellite/console - we
do not currently use it, but if we did, it should be in satellite/admin
instead.
Change-Id: I77727b97bef972df32ebb23cdc05055827076e2a
Allows us to remove the following files from satellite branding
repo, with an up-to-date single source of truth now in storj/storj:
* web/satellite/src/common/registrationSuccess.html
* web/satellite/src/common/registrationSuccess.scss
* web/satellite/src/views/register/registerArea.html
* web/satellite/src/views/register/registerArea.scss
The registrationSuccess files have been removed from all satellites in
the branding repository. The registerArea files have been removed only
from production satellites in the branding repository.
Importantly, this change enables the "resend email" functionality on
production satellites - previously, this functionality was available in
storj/storj, but not our branding repository.
Removes the config for VerificationPageURL, which redirected users away
from the satellite app to storj.io after creating an account. In order
for the email resend button to work, we cannot leave the app.
Adds a new config value for partner satellites, which replaces the
partner satellite names config. The new config includes name and
address. It is validated on setup/run to ensure it can be parsed.
Change-Id: I67db0702d9b9641f1a37b599f2929d56f3c33aca
Co-authored-by: littleskunk <jens.heimbuerge@googlemail.com>
Co-authored-by: JT Olio <hello@jtolio.com>
Co-authored-by: Igor <38665104+ihaid@users.noreply.github.com>
We can be more precise and conservative by using the backend
satellite/analytics service. We also no longer need client-side Segment
scripts.
Change-Id: Ic5fb18bea2d388b586ad773e26027d69bde87294
We already merged the multipart-upload branch to main. These two tools
make only sense if we are migrating a satellite from Pointer DB to
Metabase. There is one remaining satellite to migrate, but these tools
should be used from the respective release branch instead of from main.
Removing these tools from main will:
1) Avoid the mistake to use them from the main branch instead of from
the respective release branch.
2) Allow to finally remove any code related to the old Pointer DB.
Change-Id: Ied66098c5d0b8fefeb5d6e92b5e0ef5c6603df5d
The new default promotional coupon is $10/month, and doesn't expire.
This change also migrates the coupon.duration column over to the new
coupon.billing_periods, and switches to rely completely on
billing_periods.
Change-Id: Ic3341e9fa4040449bab5e66ca4ee2640b095cf3d
* Add a nullable billing_periods column in the coupons table
* Add nullable billing_periods column to the currently unused
coupon_codes table
* Drop the duration column from the coupon_codes table
* Replace duration config type so that the default promotional coupon
can be configured to never expire
Zero downtime migration plan:
* Add billing_periods column to coupons and coupon_codes tables (this change)
* After one release, remove all references to the old duration column,
replacing with references to billing_periods. At this point, we can also
change the defult promotional coupon to never expire and migrate over
values from the old duration column.
* After another release, drop the duration column.
Change-Id: I374e8dc9fab9f81b4a5bc681771955662d4c007a
* integration tests have now a trap that will display where script failed, line and message
* functions from uplink tests moved to utils.sh to reuse them
Change-Id: Ib2311775fd70ce784aa986328969d75eefc5ac36
This change introduces a new config flag,
--overlay.audit-history.offline-suspension-enabled,
to toggle suspending nodes for offline audits.
If the flag is set to true, nodes will be suspended if they meet the
requirements.
If the flag is false, nodes will not be suspended. If they are already
suspended and/or under review, these will be cleared.
Change-Id: Ibeba759c42d6e504f6b7598120d4fd4dab85ca74
- add Credit History table to billing acount page and set up ui for a user adding promo codes
- implement promo codes ui in registration form
- add feature flag to handle if coupon code ui should be rendered
Change-Id: I9fdeef7cffc7901958d3f9be335e1115b2471a2e
A recent change made the default usage/storage limits for projects 50gb
rather than 500gb. This increases the default limit for testversions.
Change-Id: Ibea05c0d0760662e447b6455d560a2a640801c6c
* Set up basic structure of new service.
* Implement a basic analytics track event for user creation.
Change-Id: Ica8c785540b1ef9d848404af307a22f21d33c6aa
For time of transition from pointerdb to metabase we need add migration step to rollingupgrade tests and comment few cases.
Change-Id: Ib12ae6aa14be35f9bf4ff3efb55cfc6957d4ceba
This is one step for implementing the free tier:
* Change the default project limit from 10 to 3
* Move storage and bandwidth project usage limits from the metainfo
package to the console package (otherwise there is a cyclical
dependency, and metainfo doesn't use these values anyway)
* Change the default storage usage limit per project from 500gb to 50gb
* Change the default bandwidth usage limit per project from 500gb to 50gb
* Migrate the database so that old users and projects continue to have
the old defaults (10 projects/500gb usage)
Change-Id: Ice9ee6a738bc6410da18c336c672d3fcd0cab1b9
We are implementing the free tier, which will give all new users 3
projects, 50gb storage, and 50gb bandwidth per project. All users will
receive a recurring coupon to cover this amount of usage.
With the free tier, we no longer need a paywall. Users will not need to
enter a payment method unless they want to increase their project or
usage limits.
Change-Id: If3b026e91858e5f557a2758e366616cecc8f21c7
We would like to log Node IDs and last contact successes of nodes DQd
in this manner. We would also like to avoid returning an unbounded list
of items from the db. Therefore we change the query to select a limited
number of nodes that meet the DQ conditions and iterate until 0 rows are
returned. Each column of the query is already indexed.
Change-Id: Iaec2d9b56e7202b7c2028ba21750d40c8dd506ee
Do not use Docker for running a Redis server in the test-sim that we
check that Satellite can still operate when Redis is unavailable because
we have to run this test in Jenkins and we don't want to empower a
container to be able to connect to the Docker socket of the host machine
for running sibling containers.
Change-Id: I6180e8ed804968c8ccb0783ed334acab38af9a0f
WHAT:
enter passphrase step for users who has already created passphrase
WHY:
to let users proceed to upload step
Change-Id: I084aec5b863981978cf190f99ee95154fbed9aab
WHAT:
beta satellite top banner's copy is changed to include support/feedback URLs
WHY:
so users using our beta satellite will be able to report feedback somewhere
Change-Id: Ibc349c8b3354b577275fcf1d2b75bfdd267729d9
The previous default FlushBatchSize of 10000 was causing major
slow down in select and insert statements on bucket_bandwidth_rollups.
We saw on the saltlake satellite that a FlushBatchSize of 1000 helped
reduce contention and query latency.
Change-Id: Ib95e73482219bc5aedc11925b1849fa5999774ba
WHAT:
config flag to indicate if satellite is in beta
WHY:
to avoid using hardcoded satellite names which may cause issues
Change-Id: If92eb7417c340bf343a9a91e2f6b11f0349020c5
This PR removes all back-end related referral program code including the
marketing portal.
We will have a separate PR for front-end code and database migration to
drop `offers` and `usercredits` table
Change-Id: If59f952cddfe0558a7dc03a0eac7cc1081517f88
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
The rollup archiver chore moves bucket bandwidth rollups and
storagenode rollups that are older than a given duration
to two new archive tables.
Change-Id: I1626a3742ad4271bc744fbcefa6355a29d49c6a5
Create a storj-sim test that checks that uplinks operations works when
satellite runs and can connect to Redis and when it cannot connect to
simulate a Redis downtime. Also verifies that the satellite can start
despite of Redis being downtime.
This test currently doesn't pass and it will be the one used to verify
the work that has to be done to make sure that the satellite allow the
clients to perform their operations despite of Redis being unavailable.
We require these changes before we deploy any customer face satellite on
a multi-region architecture.
NOTE that this test will be added later on to Jenkins to run this test
every time that we apply changes and at that time we'll see if it has to
be adjusted for being able to run on Jenkins because as it's now it may
not work because the scripts start and stop a Redis docker container.
Change-Id: I22acb22f0ca594583e36b45c88f8c03bac73b329
Full scope:
private/testplanet,satellite/{overlay,satellitedb}
Description:
In most cases, downtime tracking with audits will eventually lead
to DQ for nodes who are unresponsive. However, if a stray node has no
pieces, it will not be audited and will thus never be disqualified.
This chore will check for nodes who have not successfully been contacted
in some set time and DQ them.
There are some new flags for toggling DQ of stray nodes and the timeframes
for running the chore and how long nodes can go without contact.
Change-Id: Ic9d41fdbf214736798925e728245180fb3c55615
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
It turns out that alpine dropped support/updates for the aarch64 image.
Instead they have been using the arm64v8 notation for quite a while,
which resulted in breaking our recent aarch64 builds due to missing
dependencies/updates.
Both arches are exactly the same, aarch64 was created originally by Apple
and arm64 by GNU. The backends have been merged by now and the arm64 became
the de facto standard.
Since the Satellite now requires the order encryption functionality (since serial_number table is deprecated) to properly function, we can remove the config flag to turn on/off the feature.
Change-Id: Ie973f72a9a05a81cef9e53dc9c99d22c940c2488
This PR contains the minimum changes needed to stop inserting into the serial_numbers table. This is the first step in completely deprecating that table.
The next step is to create another PR to remove the expiredSerial chore, fix more tests, and remove any other methods on the serial_number table.
Change-Id: I5f12a56ebf3fa4d1a1976141d2911f25a98d2cc3
WHAT:
added brotli compression for wasm files and added copying of those files to static/wasm folder in Dockerfile
WHY:
those files are a part of web worker webpack bundle and I didn't find a way to compress them separately using webpack.
I'm open to any other ideas if they come up
Change-Id: I105cc1582e9816fd9b63052ba48358525c85a164
version
The test-versions test currently takes 1h 40min to run each time. By
running each installation concurrently, hopefully, it will reduce the execution
time for the whole test.
Change-Id: I680c7d9945e982894b11825c9075c167f754e087
We are no longer planning on implementing downtime penalization using
the method described in
docs/blueprints/archive/storage-node-downtime-tracking-deprecated.md.
Now, we are implementing the design described in
docs/blueprints/storage-node-downtime-tracking-with-audits.md.
This change removes the downtime estimation chores from the satellite
core as well as the package satellite/downtime. A future change will
remove the database table.
Change-Id: I1a1d3cf9dceeba36255d25243294865b89925518
WHAT:
POST request to get gateway credentials using access grant.
Put request url to config and use it for request.
WHY:
to show gateway credentials on UI
Change-Id: I15ef43ecdeed69b0961d5796aacb47f36d560b1b
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.
The new format for RS override values is
"k/o/n-override,k/o/n-override..."
Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
CRDB doesn't like large deletes. While testing in the POC environment we found that deletes on the serial_numbers table could take hours. This change limits deletes to 1000 at a time (configurable) to avoid blocking other queries.
Change-Id: I08455e25db1574579dd4d7b7125a08e9c913dff1
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
Currently our code is only using github version of the code, so there
shouldn't be need for the exception.
Change-Id: I0c6e8a8465ab7b525d4b5d1b29e4e5298384286d
This PR does follwing changes:
1. Change oldest uplink version in the test to v0.35.3
When the test is first created, we decided to support uplink
version starting from v0.17.1, however with many API changes,
older uplinks are not usable with latest version of the network
anymore. One of the reasons being older uplinks uses deprecated
endpoint. Therefore, we will change the oldest uplink version to
the one that's using only new endpoints.
2. Disable tls certificate verification in uplink
3. Use storj-sim version control server instead of production one
4. Skip uplink version v1.3.x due to bug in that release
Change-Id: I926a6bb9829cb7181ee752437cdcb67e59197fe0
This PR fixes below issues:
1. remove concurrent installation for various versions
We were doing this to decrease the amount of execution time the versions test.
However, it's returning incorrect exit code when there's an
installation failure.
Right now, we are only installing two versions of `storj-sim` and
the rest are only doing uplink cli installation. The performance of
this test should be hugely impacted by the setup step now.
2. only remove release settings instead of deleting the entire file
uplink CLI is referrencing `private/version` package. Therefore, we
cannot delete it
3. add back `GATEWAY_0_API_KEY` in storj-sim
In order to set up older version of uplink cli, we need access to
the gate way api key.
Change-Id: Ia3c37c197bd007b6e1f7c2bd71adde42181d46f0
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
branch
Currently, we are testing previous release version upgrading to latest
master on each master build
However, this behavior is only desired when the test is running on a
release branch.
Change-Id: Iaeb66f44951c9e4934ca3c8316d1e490d7958239
We are moving an error into rejectErr since its preventing storage nodes from being able to settle other orders.
Change-Id: I3ac97c340e491b127f5e0024c5e8bd9f4df8d5c3
We've had bizarre crashes for satellite and it's difficult to debug
stripped binaries. The only binary where stripping is useful, is uplink
cli. This change will increase uplink by 5MB.
Change-Id: I4d1dfd36452063c22e8471d99eec97f6de6167b8
Doing it at the ProcessOrders level was insufficient: the endpoints
make multiple database calls. It was a misguided attempt to only
have one spot enter the semaphore. By putting it in the endpoint
we can not only be sure that the concurrency is correctly limited
but it can be configurable easily.
Change-Id: I937149dd077adf9eb87fce52a1a17dc0afe96f64
Remove usage of --non-interactive flag. It is not provided (and
necessary) by the multitenant S3 gateway anymore.
ACCESS_KEY and SECRET_KEY env vars are not provided anymore as they are
not generated by the multitenant S3 gateway.
Change-Id: I3ecfb92110e31ae13977de3899dad273daae6c1e
We have configlock_test for checking changes so the separate script and
makefile target are not needed.
Change-Id: I2bbc1c21ad849c9b7ec8bba43c0e11e94e04f6a6
This PR adds the following items:
1) an in-memory read-only cache thats stores project limit info for projectIDs
This cache is stored in-memory since this is expected to be a small amount of data. In this implementation we are only storing in the cache projects that have been accessed. Currently for the largest Satellite (eu-west) there is about 4500 total projects. So storing the storage limit (int64) and the bandwidth limit (int64), this would end up being about 200kb (including the 32 byte project ID) if all 4500 projectIDs were in the cache. So this all fits in memory for the time being. At some point it may not as usage grows, but that seems years out.
The cache is a read only cache. When requests come in to upload/download a file, we will read from the cache what the current limits are for that project. If the cache does not contain the projectID, it will get the info from the database (satellitedb project table), then add it to the cache.
The only time the values in the cache are modified is when either a) the project ID is not in the cache, or b) the item in the cache has expired (default 10mins), then the data gets refreshed out of the database. This occurs by default every 10 mins. This means that if we update the usage limits in the database, that change might not show up in the cache for 10 mins which mean it will not be reflected to limit end users uploading/downloading files for that time period..
Change-Id: I3fd7056cf963676009834fcbcf9c4a0922ca4a8f
WHAT:
notification bar added to project dashboard page. It is shown when projects count limit is reached.
Create project button is removed after creating last available project
WHY:
inform user that their projects count limit was reached
Change-Id: If0d67148003be40cc9eb4d8b25cc17f8204008d4
This PR updates `uplink rb --force` command to use the new libuplink API
`DeleteBucketWithObjects`.
It also updates `DeleteBucket` endpoint to return a specific error
message when a given bucket has concurrent writes while being deleted.
Change-Id: Ic9593d55b0c27b26cd8966dd1bc8cd1e02a6666e
To prevent longlived unused connections, set the maximum time to 30 minutes to
prevent proxies and loadbalancers forcefully cutting the connection.
This helps in scenarios with low load/requests to a DB.
Change-Id: I7dba15ef97f6f6541e872a6fb1d3a9bbbfe5bb50
services
This PR adds a limiter on the amount of concurrent objects deletion can be handled so
we don't run out of memory.
Change-Id: Id2ce368af6f86845fcdfd34cb2f5e460efe9b272
Adds AuditHistory{WindowSize, TrackingPeriod, GracePeriod,
OfflineThreshold}. These values will be used to track offline audits over
time, and to suspend/disqualify nodes for being offline for too long.
Change-Id: I05f7dbc3c034bdc53c4fbd7719c71a44f37ec6a5
This adds a config flag orders.window-endpoint-rollout-phase
that can take on the values phase1, phase2 or phase3.
In phase1, the current orders endpoint continues to work as
usual, and the windowed orders endpoint uses the same backend
as the current one (but also does a bit extra).
In phase2, the current orders endpoint is disabled and the
windowed orders endpoint continues to use the same backend.
In phase3, the current orders endpoint is still disabled and
the windowed orders endpoint uses the new backend that requires
much less database traffic and state.
The intention is to deploy in phase1, roll out code to nodes
to have them use the windowed endpoint, switch to phase2, wait
a couple days for all existing orders to expire, then switch
to phase3.
Additionally, it fixes a bug where a node could submit a bunch
of orders and rack up charges for a bucket.
Change-Id: Ifdc10e09ae1645159cbec7ace687dcb2d594c76d
Jira: https://storjlabs.atlassian.net/browse/USR-822
This the last step of dropping these 2 db tables. It also deletes all
code associate with them.
Change-Id: I8be840dc2a7be255cf6308c9434b729fe4d9391e
Add a config so that some percent of users require credit cards /
account balances
in order to create a project or have a promotional coupon applied
UI was updated to match needed paywall status
At this point we decided not to use a field to store if a user is in an
A/B
test, and instead just use math to see if they're in a test. We decided
to use MD5 (because its in Postgres too) and User UUID for that math.
Change-Id: I0fcd80707dc29afc668632d078e1b5a7a24f3bb3
Removes old project_bandwidth_rollups records that are no longer used.
Uses a retain months configuration to determine how many months to save. Current month cannot be removed.
Tests retainMonths=-1, 0, 2
Change-Id: Ia4be2546cdb28802427acf41ecd85ad66df3e62c
WHAT:
GTM added for partnered satellites sign up pages
csp values were extended to make GTM work at all:
1. googletagmanager.com for GTM script
2. google-analytics.com for GA script
3. hash was added to avoid using 'unsafe-inline' value in 'script-src' directive
Also config flag for GTM id was added
WHY:
Marketing team needs GTM and GA for their campaigns
Change-Id: Ibb2ace737feb971dda6c191599d479fe4a7af332
When a request comes in on the satellite api and we validate the
macaroon, we now also check if any of the macaroon's tails have been
revoked.
Change-Id: I80ce4312602baf431cfa1b1285f79bed88bb4497
As the tables that get cleaned up by this job get a lot of inserts and deletes over the course of a day, the autovacuum process on PostgreSQL struggles fairly easily/quickly.
Due to its limitation, it can only delete 180,000,000 tuples in one go, before it has to rescan the entire table/index.
With the current load, the most busy satellites accumulate about 1,000,000,000 tuples per day (consumed_serials). With our current 24h interval that results in ~6-7 scans, slowing the entire database down for a quite long time.
This PR reduces the interval to 4 hours, which under a constant load, results in less than 180,000,000 entries per run.
That way, we do not scan twice for only a small gain over said amount. Reducing the interval further would also increase the DB load unnecessary, as each run scans the entire tables at least once.
For future reference, we might need to adjust the interval, if the load is significantly changing.
Change-Id: I18fdd45d93d468cff126e719c8380c29a49f43dd
We removed the ability to change satellite address in for uplink cli v1.6.3. Therefore we will skip that version for now
Change-Id: I0f78cfe27b51fd6cad571636ba38266f5f672d58
also remove the continuation support from the queue, otherwise
we may end up sequential scanning the entire table to get
a few rows at the end.
then, in the core, instead of looping both to get a big enough
batch inside of the queue, as well as outside of it to ensure
we consume the whole queue, just get a single batch at a time.
also, make the queue size configurable because we'll need to
do some tuning in production.
Change-Id: If1a997c6012898056ace89366a847c4cb141a025
WHAT:
1. updated verification page URL in config
2. added list of partnered satellites to config
3. added logic for satellites dropdown on new signup/login pages
WHY:
1. signup/login flow was reworked in tardigrade.io repo (iframe removed, new pages etc.)
2. new config flag was added to check if satellite name matches at least one member of partnered satellites list to redirect user to verification page
3. new pages will have dropdown with partnered satellites list. Appropriate logic was added.
Change-Id: I33399ab66ca31f07b297a433f6b1f41da4cb6e66
The current satellite config lock code relies on bash scripts and
gnu diff, it must be run as root and hence it typically requires
docker. The old version will be removed at a later date..
I tried for several hours to run directly against cmdSetup() in
cmd/satellite/main.go, to avoid the ctx.Compile() call. I had no
luck.
Change-Id: I0a4888421e743b436d32b6af69d04759d7816751
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.
Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.
Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
This catches the bug that hit prod where we missed that a migration
was necessary for a column, but only on uploads, and only for nodes
that had not checked in yet.
So, this does more uploads, and stops a node from starting so that
we trigger that specific scenario. Hopefully it will catch other
similar ones in the future.
I confirmed that this locally caught the bug when the release under
test was v0.34.10 and HEAD did not include the fix for it.
Change-Id: If7d41e8241d6a042fa524b4aff956b0264ecb128
Previously we are using tracing.sampled to be the switch for turning on/off tracing.
However we would like to separate sampling rate from being the switch,
so we can set sampling rate to be 0 but still intialize tracing for
satellite and storagenodes
Change-Id: I27e6ba25ea6f6b612b4e1a57cf1301889ded41ec
Added a per IP rate limiter to the console web.
Cleaned up password check to leak less bcyrpt info.
Change-Id: I3c882978bd8de3ee9428cb6434a41ab2fc405fb2
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.
Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.
Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue
Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.
Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
We want to increase our throughput for downtime estimation. This commit
adds the ability to reach out to multiple nodes concurrently for downtime
estimation. The number of concurrent routines is determined by a new config
flag, EstimationConcurrencyLimit. It also increases the default
EstimationBatchSize to 1000.
Change-Id: I800ce7ec1035885afa194c3c3f64eedd4f6f61eb
BeginObject response
We want to control inline segment size and segment size on satellite
side. We need to return such information to uplink like with redundancy
scheme.
Change-Id: If04b0a45a2757a01c0cc046432c115f475e9323c
Previous split to a storj.io/private repository broke tag-release.sh
script. This is the minimal temporary fix to make things work.
This links the build information to specified variables and sets them
inline. This approach, of course, is very fragile.
Change-Id: I73db2305e6c304146e5a14b13f1d917881a7455c
* add script for easy rolling upgrade test local execution
* remove unneeded binaries building for rolling upgrade and versions
tests
* unify build process for Jenkins and local execution for rolling
upgrade and versions tests
Change-Id: Ic11211b83f3f447494bbd5827d2af77ea4b20dfe
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.
This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.
Updates tests to test repair for both in memory and to disk.
Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
this is going to make all the tests slower but it is what it is
test-sim-aws.sh is removed because it was moved to storj/gateway repo.
Change-Id: I10727e747a4c3740b1c9054ce7d17313b4fa310b
1. only run release tags that don't contain 'rc'
2. install gateway version that's the same as satellite
3. update gateway access to contain satellite id
Change-Id: I8ca1418302c3aafdf0c4eaaf8361422a1eec2bd4
Now that we are trying to identify the root cause of the satellite load limitations (i.e. currently the satellite has a max ability of 400 rps for uploads and we need this to be higher), we are using the golang diagnostic tools to collect insight into what the bottlenecks are. We currently have a debug endpoint to gather some cpu and mem data, but it could be useful to have continuous profiling. GCP stackdriver has support for continuous profiling so lets set that up and see if it is helpful to gather more data.
This PR adds support for [GCP continuous profiler](https://cloud.google.com/profiler) which allows enabling continuous cpu/mem profiling and the stats are sent to stackdriver in google cloud console.
To enable the continuous profiling for a storj component, do the following:
- prereq: the workload must be running in GKE and have Stackdriver Profiling IAM role permissions
- provide the config flag `debug.profilename` in the config.yaml file for the workload (i.e. satellite api process, etc). The profilename should be the workload name, for example "satellite-api".
- once the above config flag is provided, the profiler will be initialized and profiling stats will automatically be sent to GCP project where the workload is running and viewable in the Stackdriver Profile page in the console
The current implementation assumes the workload is running in GKE, however if we find if useful we can add support to enable this from anywhere. But for simplicity, its configured this way assuming the main goal is to enable in production systems.
Change-Id: Ibf8ebe2df7bf06fdd4951ee6a1e48854dd36ad47
This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.
This is a move from Github PR #3645.
Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35
rationale: if GC kills the satellite, it would be nice to make
it through a repair checker sweep first
Change-Id: Id56171dc8e13940cfb6481e36a910bad077a01ed
rate
Graceful exit is very slow at the moment. Over the last couple days we
increase the batch size on Stefans satellite to 1000 but as a side
effect the error rate was increased. With a batch size of 500 the error
rate looks stable.
This PR will increase the default to batch size to 300. Graceful exit
will still be painful slow but at least it will be a bit faster. At the
same time this PR also increases the number of errors we tolerate. We
don't want to DQ slow storage nodes just because they didn't finish all
300 transfers in time. We want to give them more retries.
Change-Id: I92e3f99e116d4988457d8b902a88e85ed1bcc1a7
Currently SNs report their free disk space once per hour. If a node
becomes full, it has to wait until the next contact cycle begins to
report; all the while receiving and failing upload requests. By increasing
the minimum required disk space, we can give the storage nodes more time
to report their space before the completely fill up. This change goes
hand-in-hand with another change we want to implement: trigger capacity
report on SN immediately upon falling below threshold.
Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27
satellite api during rolling upgrade test
The old api is using the same config file as the new satellite in the
rolling upgrade test, so we need to set it to something different so
that there is no conflict when we spin up a new storj-sim instance while
the old api is running concurrently.
Change-Id: Ia4ec2db4953f36f43275495710992831ad3916a2
Control Panel allows to control different chores and services.
Currently this adds controlling of cycles.
Change-Id: I734f1676b2a0d883b8f5ba937e93c45ac1a9ce21
Allow rate limit project cache to expire so we can make project level rate limit changes without restarting the satellite process.
Change-Id: I159ea22edff5de7cbfcd13bfe70898dcef770e42
For the last few month we had no issues with order submission. I would
call it stable and now it is time to risk a lower expire time. This will
increase the database performance on the satellite and it will reduce
the delay for billing.
The long term goal is 6h but for that step we need to change graceful
exit first. At the moment storage nodes would get disuqlaified for not
transfering alle pieces in less than 6 hours.
Change-Id: I421a2c2421c5374c4e706e2338f1c2161fedc14c
With this change RS configuration will be set on satellite. Uplink with
get RS values with BeginObject request and will use it. For backward
compatibility and to avoid super large change redundancy scheme stored
with bucket is not touched. This can be done in future.
Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff
Updates config migration to occur for any v0.30.x release rather than
specifically 30.4
Also updates the config for the rolling upgrade test to use 64 kib
segments, and use smaller files for the final upload of rolling upgrade.
Change-Id: I941f77fe2b9011b45f28a5f3a2430e882d2ae6b3
Limits how many times metainfo APIs can be called per second by project ID. If limit is exceeded, the API will return Unauthorized/Too Many requests.
Limit per second and the size of the limiter cache per project are configurable, as well as whether the limiter is enabled.
Tests added/updated for the new rate_limit field in projects table.
Tests added for exceeding limits and disableing limiter.
Change-Id: Ic8ad102de3b690a475809d4f684156d5715f20fa
Fix uplink setup step for uplink versions that requires an access field.
Update how script selects uplink versions to test.
Use significantly smaller remote files for test (performance).
Change-Id: If590b8798767e2a0621fb84cd3b8852d02f6d1da
live accounting used to be a cache to store writes before they are picked up during
the tally iteration, after which the cache is cleared. This created a window in which
users could potentially exceed the storage limit. This PR refactors live accounting to
hold current estimations of space used per project. This should also reduce DB load
since we no longer need to query the satellite DB when checking space used for limiting.
The mechanism by which the new live accounting system works is as follows:
During the upload of any segment, the size of that segment is added to its respective
project total in live accounting. At the beginning of the tally iteration we record
the current values in live accounting as `initialLiveTotals`. At the end of the tally
iteration we again record the current totals in live accounting as `latestLiveTotals`.
The metainfo loop observer in tally allows us to get the project totals from what it
observed in metainfo DB which are stored in `tallyProjectTotals`. However, for any
particular segment uploaded during the metainfo loop, the observer may or may not
have seen it. Thus, we take half of the difference between `latestLiveTotals` and
`initialLiveTotals`, and add that to the total that was found during tally and set that
as the new live accounting total.
Initially, live accounting was storing the total stored amount across all nodes rather than
the segment size, which is inconsistent with how we record amounts stored in the project
accounting DB, so we have refactored live accounting to record segment size
Change-Id: Ie48bfdef453428fcdc180b2d781a69d58fd927fb
this commit introduces the reported_serials table. its purpose is
to allow for blind writes into it as nodes report in so that we have
minimal contention. in order to continue to accurately account for
used bandwidth, though, we cannot immediately add the settled amount.
if we did, we would have to give up on blind writes.
the table's primary key is structured precisely so that we can quickly
find expired orders and so that we maximally benefit from rocksdb
path prefix compression. we do this by rounding the expires at time
forward to the next day, effectively giving us storagenode petnames
for free. and since there's no secondary index or foreign key
constraints, this design should use significantly less space than
the current used_serials table while also reducing contention.
after inserting the orders into the table, we have a chore that
periodically consumes all of the expired orders in it and inserts
them into the existing rollups tables. this is as if we changed
the nodes to report as the order expired rather than as soon as
possible, so the belief in correctness of the refactor is higher.
since we are able to process large batches of orders (typically
a day's worth), we can use the code to maximally batch inserts into
the rollup tables to make inserts as friendly as possible to
cockroach.
Change-Id: I25d609ca2679b8331979184f16c6d46d4f74c1a6
We want to make using uplink as easy as possible. That's why we wan't to
avoid requiring setup or import command before normal usage if user
specified --access flag. If this flag is set then rest flags should be
set as defaults.
https://storjlabs.atlassian.net/browse/V3-3490
Change-Id: I95a7bd77a3f00b8d9981fee513e9e77aef298bca