This change will require less work for the user of peiecedeletion
service by moving overlay database call into the package.
Change-Id: I14a150ab71fe885780e7a7a74db006a779507ae5
This adds the unimplemented GetObjectIPs method to metainfo endpoint so
we can import new common protobuf definitions.
Change-Id: I154f26baccb6bb3c66de3eb25611930545c9754b
When investigating a gap in storage usage data in the SN dashboard, I noticed that there were 2 entries in the accounting_rollups table on the date of the gap.
This change accounts for multiple entries in the accounting_rollups table for a given day.
Change-Id: Ibf2b5d0455117cb0417163e8fcfb7e509d594171
It's an obsolete table from earlier state of Stripe invoices
implementation. No code is currently using it. It is confirmed that this
table is currently empty across all satellites.
Change-Id: I12d2756578faf8418ea8f3b09088e885694b8925
Small extension to test case where another partner is upload/downloading
to/from the same bucket as partner which creates this bucket.
Change-Id: Ib674fe5f95f868b71341e30aba5e2440847738f4
Use new objectdeletion package for deleting pointers.
In the best case scenario, it will make on database call to fetch
information about the number of segments. And another request to delete
and fetch information about other segments.
This PR also changes our object deletion API to return no error when an
object is not found but instead consider such operation as success. This
behavior is asligned with S3 API and makes the code less complex.
Change-Id: I280c56e8b5d815a8c4dafe8227689467e899775a
Adds AuditHistory{WindowSize, TrackingPeriod, GracePeriod,
OfflineThreshold}. These values will be used to track offline audits over
time, and to suspend/disqualify nodes for being offline for too long.
Change-Id: I05f7dbc3c034bdc53c4fbd7719c71a44f37ec6a5
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.
When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.
Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
This adds a config flag orders.window-endpoint-rollout-phase
that can take on the values phase1, phase2 or phase3.
In phase1, the current orders endpoint continues to work as
usual, and the windowed orders endpoint uses the same backend
as the current one (but also does a bit extra).
In phase2, the current orders endpoint is disabled and the
windowed orders endpoint continues to use the same backend.
In phase3, the current orders endpoint is still disabled and
the windowed orders endpoint uses the new backend that requires
much less database traffic and state.
The intention is to deploy in phase1, roll out code to nodes
to have them use the windowed endpoint, switch to phase2, wait
a couple days for all existing orders to expire, then switch
to phase3.
Additionally, it fixes a bug where a node could submit a bunch
of orders and rack up charges for a bucket.
Change-Id: Ifdc10e09ae1645159cbec7ace687dcb2d594c76d
Jira: https://storjlabs.atlassian.net/browse/USR-822
This the last step of dropping these 2 db tables. It also deletes all
code associate with them.
Change-Id: I8be840dc2a7be255cf6308c9434b729fe4d9391e
* Do not swap the active audit queue with the pending audit queue until
the active audit queue is empty.
* Do not begin creating a new pending audit queue until the existing
pending audit queue has been swapped to the active queue.
Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317
Add a config so that some percent of users require credit cards /
account balances
in order to create a project or have a promotional coupon applied
UI was updated to match needed paywall status
At this point we decided not to use a field to store if a user is in an
A/B
test, and instead just use math to see if they're in a test. We decided
to use MD5 (because its in Postgres too) and User UUID for that math.
Change-Id: I0fcd80707dc29afc668632d078e1b5a7a24f3bb3
It feels weird having a repairer configuration part of order services.
Let's have a single source of truth for it.
Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b
Inside CreateGetRepairOrderLimits we pass in a list of healthy pieces,
but when we query node info from this list we apply the "reliable" filter
again. We sometimes end up with nodes which at first were healthy, but then
became unhealthy, and thus can be repaired, but we do not update the 'unhealthyPieces'
list with these nodes.
This causes an error, 'piece to add already exists', as we fail to remove these
pieces from the pointer before replacing them with repaired pieces.
Change-Id: I6e2445f342ac117ded30351fa7e5e523c9ec26bd
Jira: https://storjlabs.atlassian.net/browse/USR-822
The balance history in Satellite GUI display the deposit bonuses as
separate rows. These bonuses used to be stored in the satellite DB. We
recently started depositing the bonus directly to the Stripe balance and
migrated old bonuses to Stripe metadata.
This change displays all billing history entirely from Stripe, so we can
remove the `credits` and `credits_spendings` DB tables in a next step.
Change-Id: I14c304c66ec47c6a51f5b8508f11470cf36c4e24
There's still a possibility of tests clashing due to the shared mock,
however it's slightly better, because it avoids the race.
Change-Id: I80eedf1ca50b6114ebe69ea3c4d61176452f4df0
Removes old project_bandwidth_rollups records that are no longer used.
Uses a retain months configuration to determine how many months to save. Current month cannot be removed.
Tests retainMonths=-1, 0, 2
Change-Id: Ia4be2546cdb28802427acf41ecd85ad66df3e62c
Jira: https://storjlabs.atlassian.net/browse/USR-968
We want to keep track of the STORJ amount and exchange rate in the
metadata of Stripe Customer Balance Transaction to be able to generate
reports without the need of requesting CoinPayments for this info.
Change-Id: Ia93af95706cd2312cf688f044874495279fe8fa2
I introduced a bug with https://review.dev.storj.io/c/storj/storj/+/2216
Because the log change allowed insert to be called multiple times.
This changes the insert logic to do nothing if the PK already exists.
Change-Id: I90d192a0f6619bfbb360ea104066f00a3348f6dd
Improve our delete logic to require fewer database requests.
This PR creates a new objectdeletion package
Change-Id: I0500173bb9b8c771accb350f076329ede6dbb42c
request
We are no longer using `BeginDeleteSegment` or `ListSegments` so we can
avoid generating StreamID as a result of `BeginDeleteObject`.
StreamID from `BeginDeleteObject` is also not used on Uplink side.
Change-Id: I3b068deab17068459849b5cf05811cad4b8a9034
We are adding a monkit evaluation for the total sum of data stored on
the nodes before it is inserted into the database. This will give us a
time-series history of total data stored so we can see it change over
time.
Change-Id: I41145a2d7a09c8e63b42ae578bd081035b60e529
To prevent creating multiple users with the same email via API, we should check for an existing user with given email.
Change-Id: Ie35b85c4f94a7ca72d42951dab8ff475d7f0dd7c
Currently a customer created via the IP does not get an payment account until he signs in.
That causes issues if the account should be deleted again.
Change-Id: I393c8f301e426301bb713c423d6ce011138d4ae4
This change switches the backend logic to use the new DB column on the users table to restrict project creation.
Furthermore it back fills the existing limits from registration tokens to the new column to ensure no users are reset to the new default.
UI is updated to reflect ability to create several projects
Change-Id: Ie29157430ae6b065411ca4c4557c9f1be69cdc4f
the flush batch size was set to 1 which means that a flush was
async scheduled after the first write. the explicit trigger wait
was then always flushing nothing, and the test would only
pass if the async flush was scheduled before the read.
remove that async flush and pause the flush loop so that we are
in full control of when the flushes happen so there are no races.
the tests are still disabled but that's because the endpoint is
still disabled.
Change-Id: I2b7b07fd5525388c30be8efbf4af7105087228da
We passed in revocationDB and metainfoDB for no reason.
Lets remove it from the dependency list to further reduce the footprint.
Change-Id: Ic0317bb92670fbd305d4a8b0ed1cb82858e2f6d3
Why: We need a way to cut down on database traffic due to bandwidth
measurement and tracking.
What: This changeset is the Satellite side of settling orders in 1 hr windows.
See design doc for more details: https://review.dev.storj.io/c/storj/storj/+/1732
Change-Id: I2e1c151e2e65516ebe1b7f47b7c5f83a3a220b31
What:
Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.
Why:
github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.
Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
STORJ_POSTGRES_TEST naming was not consistent with STORJ_SIM_POSTGRES.
This allows to use STORJ_TEST_POSTGRES for clarity, it still has a
fallback to STORJ_POSTGRES_TEST.
Change-Id: I6f294c66c80fcfd6750fea2a89795f3b7f5dd691
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.
Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
This system tracks an abstract "api version" from nodes based on
their usage, allowing us to have latching behavior where if a node
ever uses a new api, it can be blocked from using the old api.
This is better than using self-reported semver version information
because the node cannot lie, there's no confusion about what semver
version implies which features, no questions about dev and ci
environments, and no dependencies between reporting the version
and using the new api.
Change-Id: Ifeced5c9ae8e0a16102d79635e176a7d3bdd8ed4
Apply the coin payments when CoinPayments.net recieves the funds
Instead of the when STORJ gets them from CoinPayments.net
Based on 7/1/20 User Growth standup guidance by JG
Relates to: https://storjlabs.atlassian.net/browse/USR-801
Change-Id: I174ca23a585010f39464c45525e1dfe0179b7c1a
Allow more than 25 coin payment statuses to be queries from coinpayments.net
Add some slight logging, a coinpayments duration metric, and a disabled test
These changes are small changes in support of https://storjlabs.atlassian.net/browse/USR-801
Change-Id: I5b176cdd5e513d8bd88b92f9b22a8bd2456bbdd5
Since we increased the number of concurrent audit workers to two, there are going
to be instances of a single node being audited simultaneously for different segments.
If the node times out for both, we will try to write them both to the pending audits
table, and the second will return an error since the path is not the same as what
already exists. Since with concurrent workers this is expected, we will log the
occurrence rather than return an error.
Since the release default audit concurrency is 2, update testplanet default to run with
concurrent workers as well.
Change-Id: I4e657693fa3e825713a219af3835ae287bb062cb
This adds EncryptionKey definition that can be used as a flag.
These order.EncryptionKey-s will be used to encrypt data in
order limits.
This helps to avoid storing lots of transient data in the
main database.
This code doesn't yet contain encryption itself.
Change-Id: I2efae102a89b851d33342a0106f8d8b3f35119bb
Use a field to distinguish migration steps that need to use a
different transaction from previous steps. This is clearer than
using a func.
Change-Id: I2147369d05413f3e8ddb50c71a46ab1ba3ab5114
Jira: https://storjlabs.atlassian.net/browse/USR-821
The `migrate-credits` billing command checks the available credits
balance for all users and moves it to the Stripe balance by creating a
new credit balance transaction.
Change-Id: Iafc7b95a4edad47f7c145a22e210f8c821ac183d
WHAT:
GTM added for partnered satellites sign up pages
csp values were extended to make GTM work at all:
1. googletagmanager.com for GTM script
2. google-analytics.com for GA script
3. hash was added to avoid using 'unsafe-inline' value in 'script-src' directive
Also config flag for GTM id was added
WHY:
Marketing team needs GTM and GA for their campaigns
Change-Id: Ibb2ace737feb971dda6c191599d479fe4a7af332
When a request comes in on the satellite api and we validate the
macaroon, we now also check if any of the macaroon's tails have been
revoked.
Change-Id: I80ce4312602baf431cfa1b1285f79bed88bb4497
Make some minimal improvements in the README about the API documentation
regarding the required requests and successful response bodies.
Change-Id: If7832f3c40166a55d9baefdb2395211ff9e8dc04
add new columns `offline_suspended` and `under_review` to nodes table.
`unknown_audit_suspended` is a new column which will replace `suspended`
Change-Id: I22ddeb338ea0ff63f14332a7ebd0f3e9e4c06cdc
We should not be sending any type of orders to nodes that have completed
graceful exit with the current satellite. In particular, we should not
be trying to audit them, because that would be silly.
Change-Id: Ie2153e5739914ab696feefcdef28545ed70f84e4
WHAT:
credit history page implemented.
can be visited by clicking specific button in a free credits dropdown.
WHY:
UI didn't display remaining coupon value.
coupons and referral items (in future) are displayed in the same place.
Change-Id: I495fd7a99f2ea5117152aaf8f495bd5322f02588
As the tables that get cleaned up by this job get a lot of inserts and deletes over the course of a day, the autovacuum process on PostgreSQL struggles fairly easily/quickly.
Due to its limitation, it can only delete 180,000,000 tuples in one go, before it has to rescan the entire table/index.
With the current load, the most busy satellites accumulate about 1,000,000,000 tuples per day (consumed_serials). With our current 24h interval that results in ~6-7 scans, slowing the entire database down for a quite long time.
This PR reduces the interval to 4 hours, which under a constant load, results in less than 180,000,000 entries per run.
That way, we do not scan twice for only a small gain over said amount. Reducing the interval further would also increase the DB load unnecessary, as each run scans the entire tables at least once.
For future reference, we might need to adjust the interval, if the load is significantly changing.
Change-Id: I18fdd45d93d468cff126e719c8380c29a49f43dd
This test failed due to a timeout on a download which is supposed to
succeed. The testplanet default for the value is 5 seconds, but here
it is 500 milliseconds.
It looks like this is due to the fact that later in the test we need to
wait for a slow node to timeout, so we cut the timeout shorter to reduce
test time.
This PR increases the timeout to 1 second. Still not too long to wait, but
gives us twice as much time to download, decreasing the likelihood that we
see the timeout error.
Change-Id: I504db39ab5dc4d3c505520337b258265d6da7020
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.
Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
WHAT:
1. Deposit & Billing history view was divided to be shown separately as Deposit History and Billing History
2. Datepicker was removed from billing page
WHY:
billing UX enhancements
Change-Id: Ie183849ef0965169997674ce37b71db38a562fc2
This ensures that rows are closed to avoid leaks.
Also verifies that Err() is called, to ensure that no
error is left behind.
Change-Id: Idd1bec9bf479f40021da67b2c80ce83033149469
by project members
This is a fix for listing the same project twice because project has
more than one member.
Change-Id: I3f6fe3456a6753d6d091a64436c22027dcbe2520
This change adjusts the label for STORJ deposit bonuses in billing
history to be more consistent with other labels.
Change-Id: I5e7179ae3ac52dafb0dcef084e9a7c4742491f9e
The DB query in GetAllocatedBandwidthTotal uses an exclusive range:
'WHERE interval_start > ?'
The value that is used for this condition is the first day of current the month,
00:00:00 UTC.
By using the exclusive '>', we exclude the entire first hour of the month from the
result set.
Change-Id: I3ed300f5230c7514dc9495a85e8166213cd0842e
this way we don't have to do 2 steps, and by using the ctid, postgres
is going to do two very efficient prefix scans.
Change-Id: Ia9d0546cdf0a1af67ceec9cd508d336a5fdcbdb9
also remove the continuation support from the queue, otherwise
we may end up sequential scanning the entire table to get
a few rows at the end.
then, in the core, instead of looping both to get a big enough
batch inside of the queue, as well as outside of it to ensure
we consume the whole queue, just get a single batch at a time.
also, make the queue size configurable because we'll need to
do some tuning in production.
Change-Id: If1a997c6012898056ace89366a847c4cb141a025
In jaeger, it shows that this function gets called repetitively in
a single request. Most of the time, it's less than 1ms. Therefore, it
doesn't add much value in our trace but create noises.
Change-Id: I20234f36bbcf0fc22f91e5e1a5634c0cad577ed0
Jira issue: https://storjlabs.atlassian.net/browse/USR-820
The bonus for depositing STORJ tokens is now added as to the Stripe
balance instead of the to `credits` DB table on the satellite.
Existing unspent bonuses in the `credits` DB table are still processed
as usual when generating invoices. They will be migrated to the Stripe
balance with a separate change.
The bonus is added to the Stripe balance with a separate Credit
transaction. The balance transactions for the deposit and the bonus can
be differentiate by their different description.
The billing history is modified to list the bonus from the Stripe
transactions list.
The workflow for depositing STORJ tokens to the Stripe balance is
improved to survive failures in the middle of the process.
Change-Id: I6a1017984eae34e97c580f9093f7e51ca417b962
Jira issue: https://storjlabs.atlassian.net/browse/USR-110
When generating invoices, if a coupon is not fully spent, but this is
its last valid month, it will be marked as expired in the database.
Change-Id: Ib1b2375f6703f9514d1ae089a387aa029b62d1fb
struct
This change is removing ProjectID from code. Next change will be about
dropping this colum from DB table.
Change-Id: Idb949e2829e2c304a2b6b011259c7cc7667082e1
the test as written fails if it's the last day of the month because
the two tallies that are added span two different periods.
Change-Id: I294397278508035ce640500fc3b1b0ffd82b9b71
projects
Currently, to generate invoices we are listing all projects. This is
problematic because we decided that elements like e.g. coupons should be
calculated per user. We want to have
more user oriented processing. With this change to prepare invoice
elements we are:
* listing Stripe customers
* listing projects for each customer
* processing projects to create project records
* calculating coupons per customer
* calculating credit spending per customer
Change-Id: I48b0063e44e4d301f97c87ecffb342649359968a
WHAT:
1. updated verification page URL in config
2. added list of partnered satellites to config
3. added logic for satellites dropdown on new signup/login pages
WHY:
1. signup/login flow was reworked in tardigrade.io repo (iframe removed, new pages etc.)
2. new config flag was added to check if satellite name matches at least one member of partnered satellites list to redirect user to verification page
3. new pages will have dropdown with partnered satellites list. Appropriate logic was added.
Change-Id: I33399ab66ca31f07b297a433f6b1f41da4cb6e66
Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and
ecrepairer.putPiece so we can get more accurate estimates of node
performance during repair.
Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d
This will speed up the Put step of repair by not waiting to time out for
a handful of slow nodes, at the expense of a slightly less durable
pointer. It will still repair to the optimal threshold, but not every
node that is selected will end up in the pointer.
Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651
the initial calculations for the historical values of comp_at_rest
were wrong. because our historical data only included total amounts
as well as compensation for bandwidth, the at rest value was
calculated as
at_rest = total - bandwidth
unfortunately, that calculation did not take surge pricing into
account correctly. the at rest and bandwidth values do not
include surge pricing, but the total that was used did. so what
we actually calculated was
no_surge_at_rest = surge_total - no_surge_bandwidth
which will create a value that is too large. this migration
fixes the calculation for imports that are old enough and
of a non-negligable difference.
Change-Id: I61eb0b670510f6d7fb8fc3de39ba79150fac10eb
..before they are transferred to another node and submitted to the
satellite as successful piece transfers, because if we submit an invalid
signature, the node will be marked as a cheater and disqualified
immediately.
These signatures should have been validated when the piece was
originally stored, but bitrot does happen and needn't be cause for an
immediate DQ.
Change-Id: I8b0ebd5812ea8a2e60766005b7251fbb74ef7857
Some satellites do not have payment configured (ex. Salt Lake, Europe North). In this case the StripeMock is used, which returns nil on invoice and charges List methods. This causes a panic.
https://storjlabs.atlassian.net/browse/SM-978
Change-Id: Iec1b0bfd9b383e6f793d03dd63a3dec60e0fd9f3
Jira issue: https://storjlabs.atlassian.net/browse/USR-719
Invoices now show the amount of used resources and their unit price.
This change also makes proper rounding to the nearest cent in few places
to resolve the "off-by-one-cent" issue observed in few invoices.
Change-Id: I2d70d6916b5cf4a9a9138c99c422f5f4d2deb35b
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration
Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
Every now and then we see the repair error, "piece to add already exists".
With these new logs we should be able to verify if it is due to a change in
the pointer. These logs are only temporary
Change-Id: I029390cc4816668707546df14ed2cfe7ca192b0b
This attempts to add a README.md to help create consistent migrations
that maximize our test coverage and do not include unnecessary
statements.
It also adds a feature to have an `-- OLD DATA --` section as well
as a `-- NEW DATA --` section so that we can fix mistakes made in
previous snapshots (like a row that was forgotten to be added when a
table was created) without editing them going forward.
Change-Id: I28a786f8ef163cae1de1bb08f61af1e1104b0a88
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.
Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.
Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.
Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
Most places now need the NodeURL rather than the ID and Address
separately. This simplifies code in multiple places.
Change-Id: I52621d8ca52296a8b5bf7afbc1001cf8bfb44239
Currently node selection cache is biased towards the same subnet. This
implements static node selection for distinct such that it selects with
equal probability subnets rather than id-s.
This is mostly a copy paste + modifications from previous node selection
state.
Change-Id: Ia5c0aaf68e7feca78fbbd7352ad369fcb77c3a05
This allows to seeing logs in the output of the invoice commands.
Existing ensure-stripe-customer commands is moved from the 'reports' to
the new 'billing' root command.
Change-Id: I752c7ab6ca59bfac8e0f174a45d2ab45fc18e467
InvoiceApplyProjectRecords
ListUnapplied method in listing loop was using next offset as starting
point but applyProjectRecords method was changing project record state
to applied and we were missing some records to apply.
Test will be added as a separate change.
Change-Id: Id1ca33eeb66ec7f6ff1f05b45615a8935185568e
By ensuring that they have less randomness it means that they can be
compressed better. Using a timestamp should be a good improvement here.
Change-Id: Ic4dabb53335a744ff1c332dd279f37ae2cd79357
able to cover more testing scenarios
Currently, its hard to implement test suite for payments because
mockpayments is on to high level and we cannot emulate many things e.g.
adding credit card. This change is first to be able to add mock for
Stripe client and do more granular tests.
Change-Id: Ied85d4bd0642debdffe1161657c1e475202e9d23
To avoid including multiple months in a single invoice, we need all
inspector's invoice commands to run in for specific period.
See https://storjlabs.atlassian.net/browse/USR-725
Change-Id: I3637dc189234f02350daca8d897c21765762ea55
There is a subtle problem when one does a cast with `::date`. Observe:
teststorj=# set timezone = 'US/Eastern';
SET
teststorj=# select (timestamp with time zone '2020-02-01 00:00:00+00')::date;
date
------------
2020-01-31
(1 row)
teststorj=# set timezone = 'UTC';
SET
teststorj=# select (timestamp with time zone '2020-02-01 00:00:00+00')::date;
date
------------
2020-02-01
(1 row)
In order to correctly determine the date a timestamp is in, one has to
explicitly pick the time zone that the date truncation should use
otherwise postgres will use whatever setting the client has. These
tests were failing for me locally, because I run my postgres in
the US/Eastern time zone to try to tickle these bugs out. So it
should be `(x at time zone 'UTC')::date` instead of just `x::date`.
Change-Id: I4e9e32d4b53abc6165a4d0474f4702f8b9f801c7
The current satellite config lock code relies on bash scripts and
gnu diff, it must be run as root and hence it typically requires
docker. The old version will be removed at a later date..
I tried for several hours to run directly against cmdSetup() in
cmd/satellite/main.go, to avoid the ctx.Compile() call. I had no
luck.
Change-Id: I0a4888421e743b436d32b6af69d04759d7816751
See https://storjlabs.atlassian.net/browse/SM-752
These changes allow us to change the log level at runtime through a handler off of the debug endpoint.
Examples of changing the log level on storj-sim
To get the current level for the satellite api process:
curl -XGET 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain'
To change the log level:
curl -XPUT 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain' --data-raw '{"level":"error"}'
Change-Id: I05d164b290929fa06b6d78c01075ee41f8238044
We have 3 types of discounts:
1) Promotional credits/coupons
2) Bonus from depositing STORJ tokens
3) Stripe discounts (e.g. 100% off for Storjlings, 30% off for Early
Adopters, etc.)
So far the discounts were applied in the above order. But because the
Stripe discount is applied on all of the project usage fees, this could
sometimes lead to negative total in the invoice. Especially, if the
Stripe discount is 100% or all of the project fees are covered by
coupons and bonus.
To resolve this issue, before applying promotional coupons and deposit
bonuses, the Stripe discount will be applied first to the project fees.
Change-Id: I5dcbec04ec3a04e7f76b11e0a228ccb3195f2db0
Before:
- Discount from coupon: Promotional credits (limited time -
2 billing periods)
- Discount from credits
After:
- Promotional credits (limited time - 2 billing periods)
- Credits from STORJ deposit bonus
This way we don't mix the terms coupon and credit. And it is clearer
when the credit comes from a deposit bonus.
Change-Id: I4bba76a5501147f9de399eac41c4f157d6bda032
The billing history currently shows the Total amount from the Stripe
invoice. In fact, this value is just the amount deducted from the Stripe
balance. It does not reflect any deduction from promotional coupon or
bonus credits.
This patch adds the deducted amount form the promotional coupons and
bonus credits to the displayed amount in billing history. This way
customers have better understading of the total amount deducted from
their account balance on the satellite.
Change-Id: Ibd7f611a6cea0143a3059f39dd1d9ef21c264d8c
GetNodes returned references to nodes in the immutable state, however
some parts of code expect them to be modified.
Change-Id: I5be1866f95e0dbe062a6b6be60e29f2365c35faa
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.
Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.
Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
Currently uploads can cause a lot of IOPS, reduce this by introducing a
in-memory buffer on-top of the file.
Change-Id: I5f4e3e01c0a36258271d180b922107de447bcb59
TestVerifierSlowDownload would sometimes not have enough nodes finish in
the allotted deadline period. This increases the deadline and also does
not assert that exactly 3 have finished. Instead, in keeping with the
purpose of the test, it asserts that the slow download is never counted
as a success and is always counted as a pending audit in the final
report.
Change-Id: I180734fcc4a499420c75164bad6253ed155d87de
CreateTables hasn't been quite true for a while now, rename to
MigrateToLatest to be clearer in it's behavior.
Change-Id: Ida48e95122a5d9b7a814e922d3698e00024a2ba7
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.
Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
time.Now() in a short amount of time can return the exact same value.
Ensure that the test uses times that are distinct.
Change-Id: Ia653ce0af4bfcf7b5da133a9cf98b823033d9592
Before the deleter would close its done channel once, so if additional
tests shared a storagenode, even if not in parallel, the later waits
would not work properly. This fixes that problem.
Change-Id: I7dcacf6699cef7c2c2948ba0f4369ef520601bf5
CustomerRepositoryList relied on created at time for the output sorting,
however the granularity of the timer may cause multiple customers
having the same "created at" time.
Add a sleep so that every customer does get a unique time.
Change-Id: If2923174f304fa5f41260d500f6139e3fa7c3ba5
Delay of 100ms could happen due to other things happening on the test
server. Increase the time to 1s.
Change-Id: I2c7c21f966101771633d73e84cf9850d28089e71
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)
Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
When running testplanet tests, mark storagenode peer PieceDeleter as in
testing mode so that you don't have to do it on each test.
Change-Id: I2592e02c63f8bcc9152ecf436bac4e798b08bccf
Currently Cockroach isn't performant for concurrent database setup and
tear-down. Instead of a single instance allow setting multiple potential
connection strings and let the tests pick one connection string
randomly.
This improves test duration by ~10 minutes.
While we are at significantly changing how pgtest works, introduce
helper PickPostgres and PickCockroach for selecting the database to
reduce code duplications in multiple places.
Change-Id: I8ad171d5c4c8a4fc081ec2ae9bdd0cc948a80619
In cases like the segment reaper script connecting to the metainfodb,
we don't want a db migration to happen automatically when we call
metainfo.NewStore. This adds MigrateToLatest method for postgreskv
and cockroackv, and calls MigrateToLatest in places where NewStore used
to create tables.
Change-Id: I682d0f26d609af0601dfdb32a24866cdf5d32a7e
There was a race in the test code for piece deleter, which made it
possible to broadcast on the condition variable before anyone was
waiting. This change fixes that and has Wait take a context so it times
out with the context.
Change-Id: Ia4f77a7b7d2287d5ab1d7ba541caeb1ba036dba3
A/B indicates that B is a subtest of A, however in this case they
represent a configuration of the test, not a subtest.
Change-Id: I64eed5d5bcb12759e54fe4b5373f8e88488e50f7
Added a per IP rate limiter to the console web.
Cleaned up password check to leak less bcyrpt info.
Change-Id: I3c882978bd8de3ee9428cb6434a41ab2fc405fb2
To improve delete performance, we want to process deletes asynchronously
once the message has been received from the satellite. This change makes
it so that storagenodes will send the delete request to a piece Deleter,
which will process a "best-effort" delete asynchronously and return a
success message to the satellite.
There is a configurable number of max delete workers and a max delete
queue size.
Change-Id: I016b68031f9065a9b09224f161b6783e18cf21e5
Update unknown_audit_reputation_alpha and unknown_audit_reputation_beta.
Add test to verify that BatchUpdateStats properly modifies unknown audit
alpha/beta
Change-Id: I0d5f9cac96a99f64905cf575b772402db0756a9d
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.
Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.
Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
Currently it was possible that PopAll returns 1010 items, then
makes one RPC call with 1000 items, then RPC call 10 items. Meanwhile,
there have been added 500 new items added to the queue.
This change ensures that we pull items from the queue early and
try to make rpc batches as large as possible.
Change-Id: I1a30dde9164c2ff7b90c906a9544593c4f1cf0e9
This reverts commit 105dc7acc6.
Reason for revert: Recent changes to the Postgres query plan seems to want to use this index now. Reverting until we have time to analyze what's happening.
Change-Id: I74b4b5a8f15c3850d8a958a29f51dbc80e7c282c
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue
Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
Replace most of old libuplink usages in testplanet. 100% migration will
be possible when we will be able to implement UploadWithClientConfig
with new libuplink.
Change-Id: I432d7d4917c7b67d46a058abd0a2a6a13f565ac4
Automatically attach attribution information to bucket during
BeginObject or CreateBucket when the UserAgent is set.
Change-Id: I405cb26c5a2f7394b30e3f2cf5d2214c8781eb8b
We want to avoid net/http dependency in errs2 package, hence we removed
http.ErrServerClosed from IgnoreCanceled and IsCanceled check. Now we
need to add that check explicitly to every http endpoint.
Change-Id: I62b1cc0a0a2d3b43301d713a7951e5022145f88f
This adds support for observers to join the loop together. This allows
to ensure that when multiple observers join, they will be part of the
same loop iteration.
Change-Id: Ie887d4cedfb074b65c782690a2c09c1704f56dfe
During testing it's possible to get into a scenario where all nodes are
offline and list of requests is empty.
Change-Id: I271c0ca2c72009244df13e8bc1441fcd5f3da9e0
* satellite: update log levels
Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f
* log version on start up for every service
Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de
* Use monkit for tracking failed ip resolutions
Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2
* fix compile errors
Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf
* add request limit value to storage node rpc err
Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5
* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders
Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887
* fix build errs
Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
Currently storj-sim relies on the log lines to be exactly the same,
when they change it cannot find the necessary information from log.
Change-Id: Ia039915ef3375a7cf60f107b2c05c958de15b6d5
Currently ListV2 loaded the whole data into memory, even when all the
data wasn't being used, using up more memory than needed.
Change-Id: I5846d979344729b447c108a6cc9f4227229ec981
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.
Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
We want to increase our throughput for downtime estimation. This commit
adds the ability to reach out to multiple nodes concurrently for downtime
estimation. The number of concurrent routines is determined by a new config
flag, EstimationConcurrencyLimit. It also increases the default
EstimationBatchSize to 1000.
Change-Id: I800ce7ec1035885afa194c3c3f64eedd4f6f61eb
until we do a good job of cleaning them up, we should at least
not charge or pay people for them. nodes already locally delete
expired segments.
subsumes the tests in 1112.
Change-Id: I5961185764e02f6136b3231b44ecc75a9a8832c9
Whenever the node's reputation is updated, if its unknown audit
reputation is below the suspension threshold, its suspension field
is set to the current time. This could overwrite the previous
"suspendedAt" value resulting a node that never reaches the end of
its suspension.
Also log whenever a node is disqualified or its suspension status
changes
Change-Id: I5e8c8f1c46f66d79cb279b5b16a84fe03f533deb
Reduce the number of non-methods to reduce funcs in the namespace also
combine a func to slightly condense the code more.
Change-Id: Ifbe728eb8c8ca4c981df648decd259c2097b6b40
* Add migration to storagenode reputation table to add suspended
timestamp
* Send suspended info to storagenode from satellite nodestats endpoint
* Add suspended status to storagenode api
* Add an indicator on the storagenode dashboard informing operator of
the satellites the node is suspended on
Change-Id: Ie3669f6069cc0258ba76ec99d17006e1b5fd9c8a
potential encryption overhead.
This is the same approach we have for validating remote segment size.
https://storjlabs.atlassian.net/browse/USR-619
Change-Id: I2597ee734313a3068fd986001680bbedbf1bed2a
Adds a test to make sure that the correct amount of total nodes,
reputable nodes, and new nodes are returned by SelectStorageNodes
in different cases.
Change-Id: I0939159600afde8a46c35735f1edf0576fcdb4cd
This adds new endpoint /api/user/{user-email} which allows to get the
projects where the user is a member.
It also moves existing endpoint:
/project/{projectid}/limit -> /api/project/{projectid}/limit
To avoid future conflicts for displaying pages.
Change-Id: I5efe3e1c8f79894c136f92ed815f635a34ba6f98
size with BeginObject
Such solution will add one round trip to satellite during upload so for
now we are reverting this until we will have solution for this.
Change-Id: Ic2d826448ab7b0318cd6922df05deee9167cf2f0
We have been using the SQL expression `name='(*Verifier).Verify' AND
error_name='not enough shares for successful audit'` thus far to detect
cases of this problem and alert on them. Unfortunately, since this
rarely (hopefully never) happens, influxdb has no data for most of the
auditor instances, and when it has no data for a time series, it returns
no columns either. This makes Redash upset when it tries to perform a
query for an alert and can't find the column whose value it expects to
check.
This change should make it so zero values are reported when the problem
has not happened, and higher values when it has.
Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1
there are a subset of storagenodes hammering the satellite with
expired orders. if we check for expiration first, we don't have
to do a bunch of pointless signature verification. since a && b
is equal to b && a, we can order these checks in any way we want
and have it still be correct.
Change-Id: I6ffc8025c8b0d54949a1daf5f5ea1fed9e213372
BeginObject response
We want to control inline segment size and segment size on satellite
side. We need to return such information to uplink like with redundancy
scheme.
Change-Id: If04b0a45a2757a01c0cc046432c115f475e9323c
all permissions
Without read and list permissions BeginObjectDelete won't return error
if occurs. This was breaking Batch processing because there was
assumption that without error response will be always not nil.
https://storjlabs.atlassian.net/browse/SM-590
Change-Id: I0fc9539e429110a660eb28725b266d5e4771d198
uuid.UUID implements driver.Value so it can be directly used as a
scannable result.
Replace uses of dbutil.BytesToUUID with uuid.FromBytes.
Change-Id: I51a670185ceb3cc2199d5aa2b76bc3fc191ca8fe
Instead of providing the database from outside to testplanet create it
inside and then allow wrapping and modifying it. This is more convenient
to use.
Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.
This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.
Updates tests to test repair for both in memory and to disk.
Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
we still need to come up with a better plan to get storage nodes
to stop doing this, but in the meantime, we know this is happening,
just stop logging it and keep some stats instead.
Change-Id: Icb6bcba275e0e955c54b1a90da2b37219fff2349
storagenodes have like 10 or more databases. without this
tag they all get sent as the same value, stomping on each
other.
Change-Id: Ib12019684d6ea8f2a5b83df584056dfa79e3c4b3
This will help to determine how many grpc calls are made to the
satellite.
Also remove the grpc funcs that have been added to upstream.
Change-Id: I91878f4fd10f9bfe601c94222c102eaaf4d35963
* debug
* traces
* cfgstruct
* process
Package `storj/private/version` will be removed as a separate change.
Change-Id: Iadc40faa782e6225513b28218952f02d9c240a9f
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
This implements a service for pointer verification. This makes the
slightly clearer, because it's not part of metainfo.
It also adds a peer identity cache which reduces database calls and peer
identity decoding.
Change-Id: I45da40460d579c6f5fd74c69bccea215157aafda
step 1 in https://review.dev.storj.io/c/storj/uplink/+/1236
Now the old libuplink uses the temporary DeleteBucketReturnDeleted and
DeleteObjectReturnDeleted methods. This way, in the next step, we will
be able to change the DeleteBucket and DeleteObject methods to return
the deleted bucket/object.
Change-Id: I2e638be1960bca6ce1456c92849fcdd6d93e5252
by doing an indexed anti-join we're able to reduce the time to
select the pending orders by over 10x on postgres. this should
help us process pending orders much more quickly.
it probably won't do as good a job on cockroach because it does
not do an indexed anti-join and instead does a hash join after
scanning the entire consumed serials table. we should either
remove orders entirely or try to make that more efficient
when necessary.
Change-Id: I8ca0535acd21c51e74955b24c9b86d20e4f2ff9c
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair
This commit also removes unused overlay functionality.
Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).
Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states
Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
Initial change for checking bucket existence on satellite side for
requests like BeginObject and ListObjects. This is simple implementation
that is just checking bucket in DB but should be improved in future to
avoid DB calls as much as possible.
Part of https://storjlabs.atlassian.net/browse/USR-365
Change-Id: I9076acddc44d7dbfa7612a1c24a007de01621583
This adds a piece deletion handler that has debounce for failed dialing
and batching multiple jobs into a single request.
Change-Id: If64021bebb2faae7f3e6bdcceef705aed41e7d7b
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
To handle concurrent deletion requests we need to combine them into a
single request.
To implement this we introduces few concurrency ideas:
* Combiner, which takes a node id and a Job and handles combining
multiple requests to a single batch.
* Job, which represents deleting of multiple piece ids with a
notification mechanism to the caller.
* Queue, which provides communication from Combiner to Handler.
It can limit the number of requests per work queue.
* Handler, which takes an active Queue and processes it until it has
consumed all the jobs.
It can provide limits to handling concurrency.
Change-Id: I3299325534abad4bae66969ffa16c6ed95d5574f
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
We missed this in the migration that added the num_healthy_pieces
column. It exists in dbx, but not on the actual satellite table.
Change-Id: If16b5ec2325d56406250298531b3285215188bf3
Metainfo method validateAuth checks things like API key, user permission
and rate limit but at the end all errors were returned as
rpcstatus.Unauthenticated.
Old Metainfo is not touched to avoid backward compatibility issues.
Change-Id: I78eb276210fc50151da58a5c84e13ecd0961da29
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).
Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.
This change also better differentiates some error cases from Repair()
for monitoring purposes.
Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
New API has limited number of options to configure at the moment. We
should remove unused flags from Uplink CLI and add if needed in the
future.
Change-Id: Icf3f3dadd43cb61a3b408b02d0762aef34425dbf
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.
Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
- Previously, checkSegmentAltered only checked for segments that were replaced
but we want to detect all changes to a segment that occurred while an audit was being conducted.
- Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified
segments were not being removed from containment mode.
- Filled in gaps in reverify testing to ensure nodes are properly removed from containment.
Change-Id: Icd96d369278987200fd28581395725438972b292
The billing tests were flaky because some assertions ran before the
storage nodes finish their work.
A new helper function in testplanet has been added to allow to wait for
storage nodes endpoints to finish their work. This function now it's
used in the billing tests for avoiding their flakiness.
This commit closes the ticket:
https://storjlabs.atlassian.net/browse/SM-403
A part of fixing other billing tests flakiness.
Change-Id: Iacb750af435f515c04b1e1d3510a218d184c9abc
uplinks
New libuplink is not storing encryption values in with bucket but old
uplinks are using those values for configuration. If bucket was created
with new libuplink we will send back satellite defaults.
Change-Id: Ie1bf3682847e07b302270b4c4bf1a7219f4bf011
Submit an order limit with a high amount but the order has a low amount of traffic.
Make sure the order amount is used for billing.
Change-Id: I6b6ae26e9b8896f4a3acf530b2f48510b6df89cc
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.
Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.
Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
Add a test for checking that the billing:
* it doesn't include upload traffic
* it includes download traffic
Change-Id: I1655c15c1fad642f77dd210f2014b2586ae10104
This change is a special case for batch processing. If in batch request
CommitSegment and CommitObject are one after another we can execute
these requests as one. This will avoid current logic where we are saving
pointer for CommitSegment and later we are deleting this pointer and
saving it once again as under last segment path for CommitObject.
This change should handle issue we have in older uplinks with incorrect
order of storing pointers.
Change-Id: I86514c95df169e6fbc91b52e5117472cae70cb8b
these tables are used in future commits with respect to the new
storagenode payments code. if we create them now, it will make
backfilling them with historical data easier.
Change-Id: I3c08c9770ec5b2baa38b4f2fd18c2f07746a61c2
Remove the check around consuming an expired serial so that we
have more time to run the migration. It does open a small race
of double spends for entries already counted and then added to
the pending queue right around when they're going to expire and
the consumed serials have already been removed, but that should
be rare if we keep the pending queue empty.
Change-Id: I000b15979b09c67751281ff675ea6c81fc9d22dc
common/pb moved grpc to a separate package common/pb/pbgrpc.
This updates this repository to use it.
Change-Id: I2de2a190688871cf9cb61f7ea511f8a01e264e4e
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.
The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.
We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.
Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.
This is a move from Github PR #3645.
Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35
we want to return back to the user as quick as possible but also keep
deleting remaining pieces on the storagenodes
Change-Id: I04e9e7a80b17a8c474c841cceae02bb21d2e796f
Curently, storage nodes only report their capacity to satellites
once per hour. If a node fills up, it will fail all uploads until
the next contact cycle begins. With these changes, at the end of an
upload we check whether the MinimumDiskSpace threshold has been
passed. If so, trigger the monitor chore to update the node's
capacity, then trigger the contact chore to report the new
capacity to the satellites
Change-Id: Ie6aadaade1e2c12c87e03f8ff9059a50121380a0
Enhance the documentation of the UseSerialNumber method (interface and
implementation) and add several missing dots in doc comments of the
methods of the same interface and implementation.
Change-Id: I792cd344f0d2542e060fa2ec288b71231cae69de
at the end of tally iteration, in order to set the new live
accounting totals, we were iterating over all live accounting
projects. We found a bug with this when running storj-sim. If
we restarted the satellite live accounting would be cleared
because storj-sim was running the live accounting redis instance.
Since live accounting was cleared, at the end of tally, even if
it found data in projects, we would not update the live accounting
totals because we were iterating over the projects from live
accounting to do so. We now iterate over projects found from tally
in order to update live accounting
We also found that if a user deleted everything from their project,
tally would not find it and the live accounting would not be updated.
For this reason, we merge live accounting projects into tally projects
Change-Id: If0726ba0c7b692d69f42c5806e6c0f47eecccb73
rationale: if GC kills the satellite, it would be nice to make
it through a repair checker sweep first
Change-Id: Id56171dc8e13940cfb6481e36a910bad077a01ed
Trace the calls to DeletePiecesService.DeletePieces method and add
metrics for having statistics about the rate that specific storage node
is dialed and duration time spent on dialing storage nodes.
These statistics will help us to find out if we should implement
connections queues to storage node for reducing the deletion time in cae
that we see that we're spending too much time dialing frequent storage
nodes.
Ticket: https://storjlabs.atlassian.net/browse/SM-85
Change-Id: I9601676c3a8ad96c73c93833145929e4817755e2
rate
Graceful exit is very slow at the moment. Over the last couple days we
increase the batch size on Stefans satellite to 1000 but as a side
effect the error rate was increased. With a batch size of 500 the error
rate looks stable.
This PR will increase the default to batch size to 300. Graceful exit
will still be painful slow but at least it will be a bit faster. At the
same time this PR also increases the number of errors we tolerate. We
don't want to DQ slow storage nodes just because they didn't finish all
300 transfers in time. We want to give them more retries.
Change-Id: I92e3f99e116d4988457d8b902a88e85ed1bcc1a7
Fixes https://storjlabs.atlassian.net/browse/USER-240
- Adds UnsynchronizedPut method to metainfo service that overwrites any
existing pointer under the same path
- Uses UnsynchronizedPut in the metainfo endpoint for committing the
segments
Change-Id: Icb43f31ea33f14066ca9dfdcf226eb3079b90948
if redis crashed in the middle of tally we could have a situation
where we erroneously subtract from a project total. Currently,
`latest` should never be less than `initial`
Change-Id: Ibb5ab724ac0ad4d684f7954fad7a9e061104b7df
Currently SNs report their free disk space once per hour. If a node
becomes full, it has to wait until the next contact cycle begins to
report; all the while receiving and failing upload requests. By increasing
the minimum required disk space, we can give the storage nodes more time
to report their space before the completely fill up. This change goes
hand-in-hand with another change we want to implement: trigger capacity
report on SN immediately upon falling below threshold.
Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27
This change resolves all the storage node addresses to their IP addresses
before giving them to the uplink so that the uplink doesn't have to resolve
a hundred hosts and can immediately connect to improve uplink performance.
Change-Id: Idb834351e0fece409d74c8a1c29b0b8c9b09c9ff
This peer will contain our administrative panels.
It's completely separated from our other satellite
processes because it allows better control for restricting
access to it.
Change-Id: Ifca473bee82ff6c680b346918ba32b835a7a6847
Add monkit metric for the rate-limit when the rate limit is hit
Logs warning with projectID
https://storjlabs.atlassian.net/browse/SM-165
Change-Id: I352dc40006021990d1bc66a999f62bbf8deb54db
Adds ExcludedIPs to the NodeCriteria for selecting new storage
nodes. Previously, ExcludedIPs was only added to the NodeCriteria
for selecting reputable storage nodes. Now that both are included
in the FindStorageNodesWithPreferences call, it should no longer
be possible to repair pieces to nodes that are on the same IP as
nodes already storing pieces from that segment.
Adds TestSelectNewStorageNodesExcludedIPs to make sure that
SelectNewStorageNodes returns nodes with different IP addresses.
https://storjlabs.atlassian.net/browse/V3-3011
Change-Id: Ic2d5e607cadeba6e8d5c40f9717149cb30880335
more trustworthy downtime tracking
Detection chore: Do not update downtime at all from the detection chore.
We only want to include downtime between two explicitly failed ping attempts
(the duration between last contact success and the first failed ping is no longer
included in downtime calculation)
Estimation chore: If the satellite started after the last failed ping for a node,
do not include offline time since the last failed ping time - only
estimate based on two failed pings with no satellite downtime in
between.
This protects us from including satellite downtime in our storagenode downtime calculations.
Change-Id: I1fddc9f7255a7023e02474255d70c64faae75b8a
Sometimes the upload that is supposed to fail due to excess usage
would pass. This looks to be because it's overwriting another object
uploaded earlier in the test and deleting the old pointer. If tally
happened to run after the pointer is deleted but before the current
upload reaches the live accounting check, it might pass through.
The solution is to upload to a different path each time.
Change-Id: Ie6c825b9c6eab9ed53426ae262e7997bcb6beb7f
In the methods we use to retrieve a user's chargeable BW, we were summing GET, GET_AUDIT,
and GET_REPAIR. We only want to charge for GET
Change-Id: Icead7695494b22c7c835482cf8b1512a980d59f1
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
Allow rate limit project cache to expire so we can make project level rate limit changes without restarting the satellite process.
Change-Id: I159ea22edff5de7cbfcd13bfe70898dcef770e42
Core shouldn't be handling any repair load and we have already disabled it in production.
Let's make it official and remove it.
Change-Id: I46e236692a9164421648cfc974dd3246416b2e00
Satellite now is keeping RS values for uplink but old uplinks were using
default bucket settings. Because of that we need to override buckets
settings with satellite settings to avoid breaking older uplinks.
Change-Id: Ia1068db70e4adbf741c5e81d27d9e39799049c22
before dbx would generate a compilcated blob of conditions that
encoded a row comparison, which only optimized to an index seek
on cockroachdb. this means that sqlite and postgres both had
quadratic behavior on paged queries of this form. instead, use
the implicit row construction feature supported in all of the
databases to do paged support so that they all optimize well.
Change-Id: Iac8703929ba2a59ee3ffa619b916d12663422887
This reverts commit 8772867855.
for uplink versions v0.25.0 through v0.30.7, there's a bug with multiplesegment upload
where the last segment is inline caused by this commit.
Change-Id: If375e186b23265586caf08991c25980e99f3cc1a
Change is adding object deletion to BeginObject request (before upload).
Now when satellite controls deletion we can move deletion before upload
to satellite. This change improves two things:
* no need for additional request to delete object before upload (need
one more change to storj/uplink)
* fix an issue with lack of permissions to upload if caveat allows only
for writing (e.g. disallow deletes but allows to write)
https://storjlabs.atlassian.net/browse/V3-3362
Change-Id: Ic453146298cdd302df290c532123731a3f99e38e
The old migration was not working. It was updateding pending (status 0)
and failed (status -1) to completed (status 100).
Change-Id: I808ff3cc692fe6c698ce26a8b411b134e67b752b
For the last few month we had no issues with order submission. I would
call it stable and now it is time to risk a lower expire time. This will
increase the database performance on the satellite and it will reduce
the delay for billing.
The long term goal is 6h but for that step we need to change graceful
exit first. At the moment storage nodes would get disuqlaified for not
transfering alle pieces in less than 6 hours.
Change-Id: I421a2c2421c5374c4e706e2338f1c2161fedc14c
paths are organized as follows:
project_id/segment_index/bucket_name/encrypted_key
so by picking parts[0] and parts[1], we were using the segment
index instead of the bucket name, causing bandwidth to be
accounted for incorrectly. additionally, we were using the
PUT action instead of the PUT_GRACEFUL_EXIT action, causing
the data to be charged incorrectly. we use PUT_REPAIR for
now because nodes won't accept uploads with PUT_GRACEFUL_EXIT
and our tables need migrations to handle rollups with it.
Change-Id: Ife2aff541222bac930c35df8fcf76e8bac5d60b2
Enable a new golangci-lint linter that has been added to the last
release. It reports a very little number of issues so they are fix it in
this commit.
Change-Id: I74fef4779c3f592aae19103fd9f70103586fe24e
Change DeleteObjectPieces for deleting the segments' pointers of an
object in a reverse order.
Last segment: L
N: total number of segments
Deleting in reverse order is: L, n-1 to 0
Deleting in reverse order makes BeginDeleteObject usable to delete
partially uploaded objects that were interrupted (e.g. upload
cancellation).
With this change, the uplink upload cancellation, can be changed to use
BeginDeleteObject to cleanup already uploaded segments without having to
retrieve orders and dial every single node which stored a piece.
Ticket: https://storjlabs.atlassian.net/browse/V3-3525
Change-Id: Ieca6fd3801c4b71671811cb5f08a99d5146928a6
A uuid.UUID is an array of bytes, and slicing it refers to the
underlying value, much like taking the address. Because range
in Go reuses the same value for every loop iteration, this means
that later iterations would overwrite earlier stored project
ids. We fix that by making a copy of the value before slicing it
for every loop iteration.
Change-Id: Iae3f11138d11a176ce360bd5af2244307c74fdad
A few variables were not renamed to the new standard piecesTotal and
piecesContentSize, so it was unclear which value was being used. These
have been updated, and some comments made more thorough.
Change-Id: I363bad4dec2a8e5c54d22c3c4cd85fc3d2b3096c
This change updates the storagenode piecestore apis to expose access to
the full piece size stored on disk. Previously we only had access to
(and only kept a cache of) the content size used for all pieces. This
was inaccurate when reporting the amount of disk space used by nodes.
We now have access to the total content size, as well as the total disk
usage, of all pieces. The pieces cache also keeps a cache of the total
piece size along with the content size.
Change-Id: I4fffe7e1257e04c46021a2e37c5adc6fe69bee55
Currently we risk losing pending bandwidth rollup writes even on a clean
shutdown. This change ensures that all pending writes are actually
written to the db when shutting down the satellite.
Change-Id: Ideab62fa9808937d3dce9585c52405d8c8a0e703
Currently storage tests were tied to the default lookup limit.
By increasing the limits, the tests will take longer and sometimes
cause a large number of goroutines to be started.
This change adds configurable lookup limit to all storage backends.
Also remove boltdb.NewShared, since it's not used any more.
Change-Id: I1a052f149da471246fac5745da133c3cfc27582e
Currently Cockroach DB setup takes a significant amount of time.
This flattens the database setup into a single query,
which improves the test time significantly.
The migration tests still test each migration separately.
Change-Id: Iaca16f34a6af3926fa2b5ebf618f939fd59460b3
With this change RS configuration will be set on satellite. Uplink with
get RS values with BeginObject request and will use it. For backward
compatibility and to avoid super large change redundancy scheme stored
with bucket is not touched. This can be done in future.
Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff
Since incoming times may be in any time zone, and we want the output
to be in UTC and for them to have 00:00:00 hours, minutes and seconds
we first convert the incoming timestamp to UTC before doing the
truncate to the day and adding a day.
Because the old code always returned a timestamp that was in the
future, this is just for efficiency.
Change-Id: Ie692d47bca8691e73852c822d5c56cf8773d99b4
Limits how many times metainfo APIs can be called per second by project ID. If limit is exceeded, the API will return Unauthorized/Too Many requests.
Limit per second and the size of the limiter cache per project are configurable, as well as whether the limiter is enabled.
Tests added/updated for the new rate_limit field in projects table.
Tests added for exceeding limits and disableing limiter.
Change-Id: Ic8ad102de3b690a475809d4f684156d5715f20fa
This change is a special case for batch processing. If in batch request
CommitSegment and CommitObject are one after another we can execute
these request as one. This will avoid current logic where we are saving
pointer for CommitSegment and later we are deleting this pointer and
saving it once again as under last segment path for CommitObject.
Change-Id: If170e78c8410f5ba5916cbff6a29b9221db9ce2e
Replace all the remaining uses of sql.DB with tagsql.DB to
fix issues with context cancellation.
Introduce tagsql.Open which helps to get rid of all tagsql.Wrap-s.
Use tagsql in cockroachkv and postgreskv.
Change-Id: I8946d203341cb85a25976896fc7881e1f704e779
Not having a skew caused an issue where:
1. Uplink calls "begin segment", where segment isn't committed to the
database.
2. Uplink stores piece X to the storage node A with timestamp 1.
3. Satellite runs garbage collection with timestamp 2.
4. Satellite sends retain request to storage node A with timestamp 2.
5. Storage node A deletes piece X, because 1 < 2.
6. Uplink calls "commit segment" with storage node A in it.
7. Download of segment fails, because A doesn't have piece X.
In production this is not an issue since the MaxTimeSkew is 72h by
default.
Change-Id: Id87ca3ddc44103dcd85d031b1367168c014b8e7b
Also added temporary types withRebind and withTagTx,
which will be later removed. Currently they help to avoid
changing the whole codebase at the same time.
Change-Id: I7f07ba8f4709a23a463bfa67464628665a05808f
warning: databases migrated to version 77 before this commit
is merged must be manually re-migrated. this should not be a
problem for anything but staging databases.
Change-Id: Ie1631c48379472352014183ee43f1465e22200f7
live accounting used to be a cache to store writes before they are picked up during
the tally iteration, after which the cache is cleared. This created a window in which
users could potentially exceed the storage limit. This PR refactors live accounting to
hold current estimations of space used per project. This should also reduce DB load
since we no longer need to query the satellite DB when checking space used for limiting.
The mechanism by which the new live accounting system works is as follows:
During the upload of any segment, the size of that segment is added to its respective
project total in live accounting. At the beginning of the tally iteration we record
the current values in live accounting as `initialLiveTotals`. At the end of the tally
iteration we again record the current totals in live accounting as `latestLiveTotals`.
The metainfo loop observer in tally allows us to get the project totals from what it
observed in metainfo DB which are stored in `tallyProjectTotals`. However, for any
particular segment uploaded during the metainfo loop, the observer may or may not
have seen it. Thus, we take half of the difference between `latestLiveTotals` and
`initialLiveTotals`, and add that to the total that was found during tally and set that
as the new live accounting total.
Initially, live accounting was storing the total stored amount across all nodes rather than
the segment size, which is inconsistent with how we record amounts stored in the project
accounting DB, so we have refactored live accounting to record segment size
Change-Id: Ie48bfdef453428fcdc180b2d781a69d58fd927fb
this commit introduces the reported_serials table. its purpose is
to allow for blind writes into it as nodes report in so that we have
minimal contention. in order to continue to accurately account for
used bandwidth, though, we cannot immediately add the settled amount.
if we did, we would have to give up on blind writes.
the table's primary key is structured precisely so that we can quickly
find expired orders and so that we maximally benefit from rocksdb
path prefix compression. we do this by rounding the expires at time
forward to the next day, effectively giving us storagenode petnames
for free. and since there's no secondary index or foreign key
constraints, this design should use significantly less space than
the current used_serials table while also reducing contention.
after inserting the orders into the table, we have a chore that
periodically consumes all of the expired orders in it and inserts
them into the existing rollups tables. this is as if we changed
the nodes to report as the order expired rather than as soon as
possible, so the belief in correctness of the refactor is higher.
since we are able to process large batches of orders (typically
a day's worth), we can use the code to maximally batch inserts into
the rollup tables to make inserts as friendly as possible to
cockroach.
Change-Id: I25d609ca2679b8331979184f16c6d46d4f74c1a6
everyone was importing it as dbx anyway. why should it be
named satellitedb? so yeah just pass the "-p dbx" flag.
Change-Id: I5efa669f4f00f196b38a9acd0d402009475a936f
Create a service for deleting pieces of storage nodes.
Currently the DeletePieces method returns after a success threshold,
completion or a timeout.
The end goal is to return when reaching the success threshold and
leaving the remaining goroutines running after DeletePieces method
returns and add a life cycle to the service that it waits for them when
it closes.
This is the first commit for ticket:
https://storjlabs.atlassian.net/browse/V3-3476
Change-Id: If740bbf57c741f880449980b8176b036dd956c7b
This reverts commit 8e242cd012.
Revert because lib/pq has known issues with context cancellation.
These issues need to be resolved before these changes can be merged.
Change-Id: I160af51dbc2d67c5449aafa406a403e5367bb555
this will allow for some nice runtime analysis down the road.
also, this allows for wrapping database handles in a way that
can interact with these contexts
requires https://review.dev.storj.io/c/storj/dbx/+/514
Change-Id: Ib087b7cd73296dd2c1e0331314da34d861f61d2b
this allows for setting $STORJ_METAINFO_POSTGRESQL_USE_ALT=yes if you
want to use the cockroachkv implementation for metainfo against postgres
Change-Id: I0c9458c83fd67ee63ef4a78351e64a80a0647408
the hope is that it is mostly interfering with itself, so this
will make it not do that (well, N api servers, but hopefully
that's not enough to cause it to have issues).
Change-Id: Ifd0c9e6617457785ab25fe5b714d8556cdc8e2d3
When an uplink requests an upload or download from the satellite we are trackig the
allocated bandwidth twice. The value in bucket_bandwidth_rollups is used
for project limits but the value in storagenode_bandwidth_rollups is not
used at all. We can increase the performance by removing it. Uplinks
will get a faster response from the satellite.
Change-Id: Icccd41f94107ef34668f30f99bf5f728c384b07e
any database error doesn't mean the order wasn't found. for example
in cockroach it may say that the transaction is aborted. then what?
maybe we get big old row level deadlocks like we've observed? so
instead explicitly check for ErrNoRows to reject the order and bail
out otherwise. the surrounding logic will give it a retry.
Change-Id: I6e1f8f6e6a6def3e45b44f5088cbdc158e1098e4
Add a back-pressure mechanism to the satellite metainfo
DeleteObjectPieces method for returning once the 75% of successful
deleted pieces is reached.
Change-Id: Ia38df49fba5838f0605c40a77cfff8e3442cb5b0
With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and
UptimeReputationDQ. This is the first step of removing the uptime
reputation columns from satellitedb
Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36
The DeleteObjectPieces should print out the warning on closing the
connections only if there was an error.
Change-Id: If3d7ab256d8508c08388c1f22c7dd1eb819d2509
The DeleteObjectPieces must close the storage node client once it has
finished deleting its pieces.
Change-Id: I08eb8af8e4215d77d59b52f5055211b918374ab4
turns out portable sed is hard: it has to work with both
linux and bsd sed, etc. instead, use a really really basic
bash script and a temporary file. this should be much less
likely to cause issues on a wide range of machines.
Change-Id: Ia759789fb52aa1ee3361426bb6c02ed4eac3d23a
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
I also fleshed out consoledb_test.go to do actual inserts and gets to
make sure things were working correctly.
Change-Id: I089bf4c776d15dc8578080e26760bd6dff4beec9
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
Change-Id: I22b850ce5859fa07d13bf475be5140e6bde95b8a
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The WithTx
helper functions in dbutil and dbx do that automatically. I am changing
this code to use those helpers instead.
Change-Id: Iaf492af35471931125f2b7365aa4338f44154881
DeleteObjectPieces must not call overlay cache KnownReliable method with
an empty list of node IDs for avoiding to log a useless noisy warning.
Change-Id: Ibe2a34f2913f003d3ba020f9764c1369fa63123b
Move tests for old Metainfo API to separate file. Metainfo tests file is
large enough and in future it will be easier to remove old tests.
Change-Id: I9421907ef015a6dfa65f4de6ef01b2d2c8baa7df
Use the helper function IsRPC of the err2 package rather than checking
if an error is of a specific RPC status code with an 'if' conditional.
Change-Id: Ibe89d6c2d836307c3112a6d7cc6bf95f0f985fd2
Disqualifies a node when the node fails to complete a graceful
exit.
Adds a new DisqualifyNode method to the overlay cache, since there
wasn't an existing method to disqualify a node but do nothing else
to its stats.
Adds checks to existing tests to make sure that a storage node that
fails a graceful exit is marked as disqualified in the overlay
cache.
https: //storjlabs.atlassian.net/browse/V3-3342
Change-Id: I4d554a519ab59db31ad3b8e28764c8683a6e3888
crdb.ExecuteTx is great, but I don't think it will work right with
PostgreSQL. It works by way of cockroach savepoints, which allows
it to react to retryable errors, whereas tx.Commit() doesn't. But
I don't think PostgreSQL savepoints work exactly the same way. I'm not
100% sure, but it doesn't seem worth the risk.
So, I'm switching one case here to use the new dbutil.WithTx instead,
which will use crdb.ExecuteTx if appropriate. The other case doesn't
need a transaction at all.
Change-Id: I39283f3b5d8d47596db7aff5048bb74597e5918f
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
Change-Id: I660540885a0784fae844cf99376d1537e208fa69
overlay.GetOfflineNodesLimited
We only care about node ID, address, and last contact success/failure
from the downtime service, so the overlay should only return these
values for the downtime-specific queries.
Change-Id: I08a6ecfdd2a12b82cae62e87d6adeab53975bfce
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
Change-Id: Icd3da71448a84c582c6afdc6b52d1f345fe9469f
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
Change-Id: Ibaadd2c8540ba5c8cccd6ecbf529017ab98b78ca
Transactions in our code that might need to work against CockroachDB
need to be retried in the event of a retryable error. The transaction
helper functions in dbutil do that automatically. I am changing this
code to use those helpers instead.
Change-Id: Id24906f5f3ae83245dabb218e1f70e0bcb3b417a
Remove starting up messages from peers. We expect all of them to start,
if they don't, then they should return an error why they don't start.
The only informative message is when a service is disabled.
When doing initial database setup then each migration step isn't
informative, hence print only a single line with the final version.
Also use shorter log scopes.
Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93
When the context was being cancelled the error was being discarded within the rate limiting error handling which caused tests to fail.
Change-Id: I5c6458c16da09a11531233ea0ee80d914969cb3f
deletePointer must return an ErrObjectNotFound rather than a rpc status
error NotFound because the callers must distinguish such error if it
comes from the getPointer or from the UnsynchronizedDelete.
Change-Id: I68b4e45a2765e63b73bf85c2c39a5fc0198373f6
We don't want slowloris nodes to be able to indefinitely block
up the satellite, so add a timeout. Some monitoring inspection
showed the largest success times being on the order of 30s, so
a 1min timeout should be sufficient to kill the misbehaving nodes.
Change-Id: I5e2c3480a15f6304e37262d0a4d30d07eae99bb3
As per discussed we decided to rate limit how fast we iterate through
the metainfo database in the metainfo loop. This puts in place a
mechanism for rate limiting and burst limiting if need be in the future.
The default for this rate limiting is still no limits so it stays the
same as our previous functionality.
Change-Id: I950f7192962b0e49f082d2c4284e2d52b0a925c7
We are missing some tests for new Metainfo API that we have for old API.
This is first change to adjust old tests to new API.
Change-Id: Ie2b16bf85de8633662f952e863dbf3d409d801d9
For improving the deletion performance we are shifting the
responsibility to delete the pieces of the object from Uplink to the
Satellite.
BeginDeleteObject was the first call to return the stream ID which was
used for after retrieving the list of segments and then get addressed
order limits for deleting the pieces (of each segment) from the storage
nodes.
Now we want the Satellite deletes the pieces of all the object segments
from the storage nodes hence we don't need anymore to have several
network round trips between the Uplink and the Satellite because the
Satellite can delete all of them in the initial BegingDeleteObject
request.
satellite/metainfo.ListSegments has been changed to return 0 items if
the pointer of the last segment of an object is not found because we
need to preserve the backward compatibility with Uplinks that won't be
updated to the last release and they rely on listing the segments after
calling BeginDeleteObject for retrieving the addressed order limits
to contact the storage nodes to delete the pieces.
Change-Id: I5f99ecf27d62d65b0a062936b9b17581ef692af0
Remove direct dependency on uplink.RSConfig, this simplifies
moving the config file without introducing weird dependencies.
Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec
Adds check to see if storage nodes are eligible to initiate
graceful exit, by checking their CreatedAt date and seeing if
their "age" is greater than the new config value:
NodeMinAgeInMonths
The default for this value is 6 months for now.
https://storjlabs.atlassian.net/browse/V3-3357
Change-Id: Ib807ab8987ddb5a38a27a83886490f73fe8c5816
The endpoint listSegmentsManually method misses a check for the limit
parameter, otherwise it can return inconsistent results when it's 0 or
negative.
When 0 or negative, without the check, it returns no segments but also
that there isn't more segments and that isn't correct.
The function is only called from the Endpoint.ListSegments method and
the function cares to ensure that limit is always greater than 0, but if
the method doesn't check that a new future caller could misuse it and
provoke a bug.
Additionally:
* Documentation for the modified function has been written
* The part of the function that repeated the logic of the
Endpoint.getPointer method has been removed for using that method.
* Added logging before returning an internal error in
Endpoint.getPointer.
Change-Id: I5c4f0db2292da0162db6b7d63553895808d0925a
Do some cleanup for adding new identified TODOs (associated with ticket
https://storjlabs.atlassian.net/browse/V3-3406) and remove an old one.
Change-Id: I5d20dbe1c4dee0a8279e08b05b907f4cc9dba278
In satellite/accounting/rollup Service.RollupStorage we have a few
potential error scenarios that return time.Now(). Especially in the case
where we exit early because we have received 0 tallies since the *last*
rollup, this creates a potential race condition.
Between the time we call GetTalliesSince and realize it is empty, it's
possible a tally was inserted in that interval. As currently written we
are returning a latestTally time that excludes that tally.
We are currently protected because in Service.Rollup we don't save the
rollup unless we have populated the rollupStats. However, this change is
more correct and future-proof, because Service.RollupStorage should
always return a correct latestTally time, which in case of errors and
empty tallies, is the last successful tally.
Change-Id: I2521a2cc9802c8f06e512dde4422803a272e2a0a
Adds the KnownReliable method to Overlay Service that filters all nodes
from the given list to be only reliable nodes (online and qualified).
The method return []*pb.Node of reliable nodes. The pb.Node values are
ready for dialing.
The first use case is when deleting an object to efficiently dial all
reliable nodes holding a piece of that object and send them a delete
request.
Change-Id: I13e0a8666f3807c5c31ef1a1087476018a5d3acb
Fixes a data race caused by not waiting for workers to finish
before shutting down. Currently this ended up failing logging
because it was closed when test tried to write to it.
Change-Id: I074045cd83bbf49e658f51353aa7901e9a5d074b
this will allow us to inspect the type of `db.Driver()` on *sql.DB
connections to correctly differentiate between pg and crdb conns.
as a bonus, this moves all concerns about when to replace "cockroach://"
with "postgres://" out of view, letting the thin shim "driver" take care
of that.
Change-Id: Ib24103ab7c508231e681f89a7321b623e4e125e9
Backstory: I needed a better way to pass around information about the
underlying driver and implementation to all the various db-using things
in satellitedb (at least until some new "cockroach driver" support makes
it to DBX). After hitting a few dead ends, I decided I wanted to have a
type that could act like a *dbx.DB but which would also carry
information about the implementation, etc. Then I could pass around that
type to all the things in satellitedb that previously wanted *dbx.DB.
But then I realized that *satellitedb.DB was, essentially, exactly that
already.
One thing that might have kept *satellitedb.DB from being directly
usable was that embedding a *dbx.DB inside it would make a lot of dbx
methods publicly available on a *satellitedb.DB instance that previously
were nicely encapsulated and hidden. But after a quick look, I realized
that _nothing_ outside of satellite/satellitedb even needs to use
satellitedb.DB at all. It didn't even need to be exported, except for
some trivially-replaceable code in migrate_postgres_test.go. And once
I made it unexported, any concerns about exposing new methods on it were
entirely moot.
So I have here changed the exported *satellitedb.DB type into the
unexported *satellitedb.satelliteDB type, and I have changed all the
places here that wanted raw dbx.DB handles to use this new type instead.
Now they can just take a gander at the implementation member on it and
know all they need to know about the underlying database.
This will make it possible for some other pending code here to
differentiate between postgres and cockroach backends.
Change-Id: I27af99f8ae23b50782333da5277b553b34634edc
* Use unexported existent method in logic that was duplicated in some
exported methods.
* Log a forgotten internal error.
* Improve the documentation adding more and fixing some to fit to our
code style conventions.
Change-Id: Ie6f8bc59f9089f92b8b0d1b4c09c2142c3f273f5
The Endpoint.getPointer method lacked of tracing.
Also add a dot at the end of documentation comment for following our
code style conventions.
Change-Id: I9b63ad297f04e31825648aae43aa8f9ebba2b4e2
Return an error when misusing the endpoint method
'listSegmentsFromNumberOfSegments' because there is the method
'listSegmentsManually' for being used when the number of segments is
less or equal than 0.
If we don't return an error on `listSegmentsFromNumberOfSegments` we
would realize that we have a bug much more later than returning an error
because the clients wouldn't receive an error and would receive an empty
list, making them to wonder what they are doing wrong to receive 0
results before they realize that they could be in front of a bug.
This commit also renames the function to be plural as "numberOfSegments"
parameter and the test function which missed also the end 's'.
Change-Id: I02318685bf36aa3af26545731a1711621a5e2e39
planet.Start starts a testplanet system, whereas planet.Run starts a testplanet
and runs a test against it with each DB backend (cockroach compat).
Change-Id: I39c9da26d9619ee69a2b718d24ab00271f9e9bc2
for storj-sim to work, we need to avoid schemas in cockroach urls
so we have storj-sim create namespaced databases instead of schemas
and we have the migrate command create the database in the same way
that it would create a schema for postgres. then it works!
a follow up commit will move the creation of the database/schemas
into storj-sim's setup step so that we can avoid doing these icky
creations during normal migration calls. it will also make the
pointerdb have an explicit call to migrate instead of just doing
it every time it's opened.
Change-Id: If69ef5cb96b6866b0438c761bd445afb3597ae5f
satellitedb migration tests ran against multiple base versions, however after the merging all the steps the base versions didn't exists anymore - which meant none of the migration tests were actually running.
Fix a documentation comment for one method and apply our code
conventions to some that I stumbled.
Change-Id: I3baf5d004a128dcd561c3e27c080aab345c64461
first, so that they all work the same way, because it's getting
complicated, and second, so that we can do the appropriate thing
instead of CREATE SCHEMA for cockroachdb.
Change-Id: I27fbaeeb6223a3e06d97bcf692a2d014b31465f7
it doesn't necessarily _have_ to be UTC; the time is correct as returned
either way, but this will make it a little less prone to variance.
also, there is a test that depends on the time being returned in UTC.
Change-Id: Ia71e24ecd9973ba70a1cfb5621a3030a5c82d004
Improve the piece hash validation filtering out a piece when an order
limit is not found for it.
The commit also improves the documentation of an internal metainfo
method and rename the parameters of 2 methods for clarifying what they
are.
This will make it so we don't need to comment out those lines every time
we want to enable the cockroachdb tests during development.
Once it's ready this flag can go away.
* update migration steps, add crdb support to testplanet
* add crdb support
* have jenkins run a bares bones crdb compat test
* skip crdb tests
* skip crdb tests
* fix root_piece_id column
* write crdb store to tmp dir
* escape
* satellite/console: Add X-Frame-Options and Referrer-Policy security headers
* Update to use CSP instead of XFO and include tardigrade.io
* Make FrameAncestors a config option
* Update satellite-config lock
* Make help text for FrameAncestors better
* satellite/metainfo: Rollback path parts check in loop
We have to rollback the changes applied in checking the rawPath parts
from 4 to 3 because the production prointerDB is still storing buckets.
* satellite/metainfo: Don't return path parts less 4
Don't return an error in the metainfo loop iterator when a path doesn't
have 4 parts because it belongs to bucket metadata, not an actual
object.
* merge migration
* rm migration versions
* rm unneeded migration test data
* create index w/postgres + crdb compatible syntax
* add default to offers.invitee_credit_duration_days
* changes so that schema matches from master to branch
* change to be crdb compatible
* add check to confirm db version
* mv version check to migration
* update tests
* add minversion to sadb migration, update tests
* confirm min version for all dbs in a migration
* add validate migration to sadb
* fix lint err
* rm min version check from migrate
* change sadb check
* hard code min db version
* fix comment
* skip unknown errors (wip)
* add tests to make sure nodes that time out are added to containment
* add bad blobs store
* call "Skipped" "Unknown"
* add tests to ensure unknown errors do not trigger containment
* add monkit stats to lockfile
* typo
* add periods to end of bad blobs comments
* satellite/nodeselection: dont select nodes that havent checked in for a while
* change testplanet online window to one minute
* remove satellite reconfigure online window = 0 in repair tests
* pass timestamp into UpdateCheckIn
* change timestamp to timestamptz
* edit tests to set last_contact_success to 4 hours ago
* fix syntax error
* remove check for last_contact_success > last_contact_failure in IsOnline
Large conditional blocks are hard to read.
When the conditional block only has one branch it's easy to understand
the logic of the function to early return switching the condition.
We don't use reverse listing in any of our code, outside of tests, and
it is only exposed through libuplink in the
lib/uplink.(*Project).ListBuckets() API. We also don't know of any users
who might have a need for reverse listing through ListBuckets().
Since one of our prospective pointerdb backends can not support
backwards iteration, and because of the above considerations, we are
going to remove the reverse listing feature.
Change-Id: I8d2a1f33d01ee70b79918d584b8c671f57eef2a0
* during audit Verify, return error and delete segment if segment is expired
* delete "main" reverify segment and return error if expired
* delete contained nodes and pointers when pointers to audit are expired
* update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now
* Revert "update testplanet.Upload and testplanet.UploadWithConfig to use an expiration time of an hour from now"
This reverts commit e9066151cf84afbff0929a6007e641711a56b6e5.
* do not count ExpirationDate=time.Time{} as expired
* If a node claims to fail a transfer due to piece not found, remove that node from the pointer, delete the transfer queue item.
* If the pointer is piece hash verified, penalize the node. Otherwise, do not penalize the node.
* change satellite.Peer name to Core
* change to Core in testplanet
* missed a few places
* keep shared stuff in peer.go to stay consistent with storj/docs
* separate sadb migration, add version check
* update checkversion to do same validation as migration
* changes per CR
* add sa migration to storj-sim
* add different debug port in storj-sim for migration
* add wait for exit for storj-sim migration
* update sa docker entrypoint to support migration
* storj-sim satellite parts all wait for migration
* upgrade golang-migrate/migrate to v4 because bug
* fix go mod tidy
* rm dup api code from sa peer, update storj-sim
* fix for backwards compat tests
* use env var instead of localhost
* changes per CR
* fix env var name
* skip peer for setup
* improve errors in satellite contact endpoints
* add changes per CR comments
* update pingback method so it still updates node table
* fix err and returns
* fix zap logging to be better
* set up satellite repair run command
* add separated repair process to storj-sim
* add repairer peer to satellite in testplanet
* move api run cmd into api.go
* add satellite run repair to entrypoint
* check duplicate node id before update pointer
* add test for transfer failure when pointer already contain the receiving node id
* check exiting and receiving nod are still in the pointer
* check node id only exists once in a pointer
* return error if the existing node doesn't match with the piece info in the pointer
* try to recreate the issue on jenkins
* should not remove exiting node piece in test
* Update satellite/gracefulexit/endpoint.go
Co-Authored-By: Maximillian von Briesen <mobyvb@gmail.com>
* Update satellite/gracefulexit/endpoint.go
Co-Authored-By: Maximillian von Briesen <mobyvb@gmail.com>
* add signatures, fix process loop bug, move delete to on success
* added tests for signatures
* PR comment updates
* fixed setting reason by default.
* updates for PR comments
* added signed failure when verificationi fails
* moved to sign_test
* fix panic
* removed testplanet from test
* Make the exiting node check piece hashes, piece IDs, and piece hash signatures before relaying successful transfer data to the satellite.
* Enable immediate graceful exit failure for "successful" transfers that fail satellite-side validation.
* Move transfer piece logic in storagenode worker to separate function (to make the worker easier to understand)