Apply the coin payments when CoinPayments.net recieves the funds
Instead of the when STORJ gets them from CoinPayments.net
Based on 7/1/20 User Growth standup guidance by JG
Relates to: https://storjlabs.atlassian.net/browse/USR-801
Change-Id: I174ca23a585010f39464c45525e1dfe0179b7c1a
Allow more than 25 coin payment statuses to be queries from coinpayments.net
Add some slight logging, a coinpayments duration metric, and a disabled test
These changes are small changes in support of https://storjlabs.atlassian.net/browse/USR-801
Change-Id: I5b176cdd5e513d8bd88b92f9b22a8bd2456bbdd5
Since we increased the number of concurrent audit workers to two, there are going
to be instances of a single node being audited simultaneously for different segments.
If the node times out for both, we will try to write them both to the pending audits
table, and the second will return an error since the path is not the same as what
already exists. Since with concurrent workers this is expected, we will log the
occurrence rather than return an error.
Since the release default audit concurrency is 2, update testplanet default to run with
concurrent workers as well.
Change-Id: I4e657693fa3e825713a219af3835ae287bb062cb
This adds EncryptionKey definition that can be used as a flag.
These order.EncryptionKey-s will be used to encrypt data in
order limits.
This helps to avoid storing lots of transient data in the
main database.
This code doesn't yet contain encryption itself.
Change-Id: I2efae102a89b851d33342a0106f8d8b3f35119bb
Use a field to distinguish migration steps that need to use a
different transaction from previous steps. This is clearer than
using a func.
Change-Id: I2147369d05413f3e8ddb50c71a46ab1ba3ab5114
Jira: https://storjlabs.atlassian.net/browse/USR-821
The `migrate-credits` billing command checks the available credits
balance for all users and moves it to the Stripe balance by creating a
new credit balance transaction.
Change-Id: Iafc7b95a4edad47f7c145a22e210f8c821ac183d
WHAT:
GTM added for partnered satellites sign up pages
csp values were extended to make GTM work at all:
1. googletagmanager.com for GTM script
2. google-analytics.com for GA script
3. hash was added to avoid using 'unsafe-inline' value in 'script-src' directive
Also config flag for GTM id was added
WHY:
Marketing team needs GTM and GA for their campaigns
Change-Id: Ibb2ace737feb971dda6c191599d479fe4a7af332
When a request comes in on the satellite api and we validate the
macaroon, we now also check if any of the macaroon's tails have been
revoked.
Change-Id: I80ce4312602baf431cfa1b1285f79bed88bb4497
Make some minimal improvements in the README about the API documentation
regarding the required requests and successful response bodies.
Change-Id: If7832f3c40166a55d9baefdb2395211ff9e8dc04
add new columns `offline_suspended` and `under_review` to nodes table.
`unknown_audit_suspended` is a new column which will replace `suspended`
Change-Id: I22ddeb338ea0ff63f14332a7ebd0f3e9e4c06cdc
We should not be sending any type of orders to nodes that have completed
graceful exit with the current satellite. In particular, we should not
be trying to audit them, because that would be silly.
Change-Id: Ie2153e5739914ab696feefcdef28545ed70f84e4
WHAT:
credit history page implemented.
can be visited by clicking specific button in a free credits dropdown.
WHY:
UI didn't display remaining coupon value.
coupons and referral items (in future) are displayed in the same place.
Change-Id: I495fd7a99f2ea5117152aaf8f495bd5322f02588
As the tables that get cleaned up by this job get a lot of inserts and deletes over the course of a day, the autovacuum process on PostgreSQL struggles fairly easily/quickly.
Due to its limitation, it can only delete 180,000,000 tuples in one go, before it has to rescan the entire table/index.
With the current load, the most busy satellites accumulate about 1,000,000,000 tuples per day (consumed_serials). With our current 24h interval that results in ~6-7 scans, slowing the entire database down for a quite long time.
This PR reduces the interval to 4 hours, which under a constant load, results in less than 180,000,000 entries per run.
That way, we do not scan twice for only a small gain over said amount. Reducing the interval further would also increase the DB load unnecessary, as each run scans the entire tables at least once.
For future reference, we might need to adjust the interval, if the load is significantly changing.
Change-Id: I18fdd45d93d468cff126e719c8380c29a49f43dd
This test failed due to a timeout on a download which is supposed to
succeed. The testplanet default for the value is 5 seconds, but here
it is 500 milliseconds.
It looks like this is due to the fact that later in the test we need to
wait for a slow node to timeout, so we cut the timeout shorter to reduce
test time.
This PR increases the timeout to 1 second. Still not too long to wait, but
gives us twice as much time to download, decreasing the likelihood that we
see the timeout error.
Change-Id: I504db39ab5dc4d3c505520337b258265d6da7020
Since we increased the number of audit workers from 1 to 2, we need to make sure
concurrent updates do not trample each other. We can do this by serializing the
transactions.
Change-Id: If1b2f71cabe3c779c12ffa33c0c3271778ac3ae0
WHAT:
1. Deposit & Billing history view was divided to be shown separately as Deposit History and Billing History
2. Datepicker was removed from billing page
WHY:
billing UX enhancements
Change-Id: Ie183849ef0965169997674ce37b71db38a562fc2
This ensures that rows are closed to avoid leaks.
Also verifies that Err() is called, to ensure that no
error is left behind.
Change-Id: Idd1bec9bf479f40021da67b2c80ce83033149469
by project members
This is a fix for listing the same project twice because project has
more than one member.
Change-Id: I3f6fe3456a6753d6d091a64436c22027dcbe2520
This change adjusts the label for STORJ deposit bonuses in billing
history to be more consistent with other labels.
Change-Id: I5e7179ae3ac52dafb0dcef084e9a7c4742491f9e
The DB query in GetAllocatedBandwidthTotal uses an exclusive range:
'WHERE interval_start > ?'
The value that is used for this condition is the first day of current the month,
00:00:00 UTC.
By using the exclusive '>', we exclude the entire first hour of the month from the
result set.
Change-Id: I3ed300f5230c7514dc9495a85e8166213cd0842e
this way we don't have to do 2 steps, and by using the ctid, postgres
is going to do two very efficient prefix scans.
Change-Id: Ia9d0546cdf0a1af67ceec9cd508d336a5fdcbdb9
also remove the continuation support from the queue, otherwise
we may end up sequential scanning the entire table to get
a few rows at the end.
then, in the core, instead of looping both to get a big enough
batch inside of the queue, as well as outside of it to ensure
we consume the whole queue, just get a single batch at a time.
also, make the queue size configurable because we'll need to
do some tuning in production.
Change-Id: If1a997c6012898056ace89366a847c4cb141a025
In jaeger, it shows that this function gets called repetitively in
a single request. Most of the time, it's less than 1ms. Therefore, it
doesn't add much value in our trace but create noises.
Change-Id: I20234f36bbcf0fc22f91e5e1a5634c0cad577ed0
Jira issue: https://storjlabs.atlassian.net/browse/USR-820
The bonus for depositing STORJ tokens is now added as to the Stripe
balance instead of the to `credits` DB table on the satellite.
Existing unspent bonuses in the `credits` DB table are still processed
as usual when generating invoices. They will be migrated to the Stripe
balance with a separate change.
The bonus is added to the Stripe balance with a separate Credit
transaction. The balance transactions for the deposit and the bonus can
be differentiate by their different description.
The billing history is modified to list the bonus from the Stripe
transactions list.
The workflow for depositing STORJ tokens to the Stripe balance is
improved to survive failures in the middle of the process.
Change-Id: I6a1017984eae34e97c580f9093f7e51ca417b962
Jira issue: https://storjlabs.atlassian.net/browse/USR-110
When generating invoices, if a coupon is not fully spent, but this is
its last valid month, it will be marked as expired in the database.
Change-Id: Ib1b2375f6703f9514d1ae089a387aa029b62d1fb
struct
This change is removing ProjectID from code. Next change will be about
dropping this colum from DB table.
Change-Id: Idb949e2829e2c304a2b6b011259c7cc7667082e1
the test as written fails if it's the last day of the month because
the two tallies that are added span two different periods.
Change-Id: I294397278508035ce640500fc3b1b0ffd82b9b71
projects
Currently, to generate invoices we are listing all projects. This is
problematic because we decided that elements like e.g. coupons should be
calculated per user. We want to have
more user oriented processing. With this change to prepare invoice
elements we are:
* listing Stripe customers
* listing projects for each customer
* processing projects to create project records
* calculating coupons per customer
* calculating credit spending per customer
Change-Id: I48b0063e44e4d301f97c87ecffb342649359968a
WHAT:
1. updated verification page URL in config
2. added list of partnered satellites to config
3. added logic for satellites dropdown on new signup/login pages
WHY:
1. signup/login flow was reworked in tardigrade.io repo (iframe removed, new pages etc.)
2. new config flag was added to check if satellite name matches at least one member of partnered satellites list to redirect user to verification page
3. new pages will have dropdown with partnered satellites list. Appropriate logic was added.
Change-Id: I33399ab66ca31f07b297a433f6b1f41da4cb6e66
Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and
ecrepairer.putPiece so we can get more accurate estimates of node
performance during repair.
Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d
This will speed up the Put step of repair by not waiting to time out for
a handful of slow nodes, at the expense of a slightly less durable
pointer. It will still repair to the optimal threshold, but not every
node that is selected will end up in the pointer.
Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651
the initial calculations for the historical values of comp_at_rest
were wrong. because our historical data only included total amounts
as well as compensation for bandwidth, the at rest value was
calculated as
at_rest = total - bandwidth
unfortunately, that calculation did not take surge pricing into
account correctly. the at rest and bandwidth values do not
include surge pricing, but the total that was used did. so what
we actually calculated was
no_surge_at_rest = surge_total - no_surge_bandwidth
which will create a value that is too large. this migration
fixes the calculation for imports that are old enough and
of a non-negligable difference.
Change-Id: I61eb0b670510f6d7fb8fc3de39ba79150fac10eb
..before they are transferred to another node and submitted to the
satellite as successful piece transfers, because if we submit an invalid
signature, the node will be marked as a cheater and disqualified
immediately.
These signatures should have been validated when the piece was
originally stored, but bitrot does happen and needn't be cause for an
immediate DQ.
Change-Id: I8b0ebd5812ea8a2e60766005b7251fbb74ef7857
Some satellites do not have payment configured (ex. Salt Lake, Europe North). In this case the StripeMock is used, which returns nil on invoice and charges List methods. This causes a panic.
https://storjlabs.atlassian.net/browse/SM-978
Change-Id: Iec1b0bfd9b383e6f793d03dd63a3dec60e0fd9f3
Jira issue: https://storjlabs.atlassian.net/browse/USR-719
Invoices now show the amount of used resources and their unit price.
This change also makes proper rounding to the nearest cent in few places
to resolve the "off-by-one-cent" issue observed in few invoices.
Change-Id: I2d70d6916b5cf4a9a9138c99c422f5f4d2deb35b
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration
Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
Every now and then we see the repair error, "piece to add already exists".
With these new logs we should be able to verify if it is due to a change in
the pointer. These logs are only temporary
Change-Id: I029390cc4816668707546df14ed2cfe7ca192b0b
This attempts to add a README.md to help create consistent migrations
that maximize our test coverage and do not include unnecessary
statements.
It also adds a feature to have an `-- OLD DATA --` section as well
as a `-- NEW DATA --` section so that we can fix mistakes made in
previous snapshots (like a row that was forgotten to be added when a
table was created) without editing them going forward.
Change-Id: I28a786f8ef163cae1de1bb08f61af1e1104b0a88
What: As soon as a node passes the vetting criteria (total_audit_count and total_uptime_count
are greater than the configured thresholds), we set vetted_at to the current timestamp.
Why: We may want to use this timestamp in future development to select new vs vetted nodes.
It also allows flexibility in node vetting experiments and allows for better metrics around
vetting times.
Please describe the tests: satellitedb_test: TestUpdateStats and TestBatchUpdateStats make sure vetted_at is set appropriately
Please describe the performance impact: This change does add extra logic to BatchUpdateStats and UpdateStats and
commits another variable to the db (vetted_at), but this should be negligible.
Change-Id: I3de804549b5f1bc359da4935bc859758ceac261d
Most places now need the NodeURL rather than the ID and Address
separately. This simplifies code in multiple places.
Change-Id: I52621d8ca52296a8b5bf7afbc1001cf8bfb44239
Currently node selection cache is biased towards the same subnet. This
implements static node selection for distinct such that it selects with
equal probability subnets rather than id-s.
This is mostly a copy paste + modifications from previous node selection
state.
Change-Id: Ia5c0aaf68e7feca78fbbd7352ad369fcb77c3a05
This allows to seeing logs in the output of the invoice commands.
Existing ensure-stripe-customer commands is moved from the 'reports' to
the new 'billing' root command.
Change-Id: I752c7ab6ca59bfac8e0f174a45d2ab45fc18e467
InvoiceApplyProjectRecords
ListUnapplied method in listing loop was using next offset as starting
point but applyProjectRecords method was changing project record state
to applied and we were missing some records to apply.
Test will be added as a separate change.
Change-Id: Id1ca33eeb66ec7f6ff1f05b45615a8935185568e
By ensuring that they have less randomness it means that they can be
compressed better. Using a timestamp should be a good improvement here.
Change-Id: Ic4dabb53335a744ff1c332dd279f37ae2cd79357
able to cover more testing scenarios
Currently, its hard to implement test suite for payments because
mockpayments is on to high level and we cannot emulate many things e.g.
adding credit card. This change is first to be able to add mock for
Stripe client and do more granular tests.
Change-Id: Ied85d4bd0642debdffe1161657c1e475202e9d23
To avoid including multiple months in a single invoice, we need all
inspector's invoice commands to run in for specific period.
See https://storjlabs.atlassian.net/browse/USR-725
Change-Id: I3637dc189234f02350daca8d897c21765762ea55
There is a subtle problem when one does a cast with `::date`. Observe:
teststorj=# set timezone = 'US/Eastern';
SET
teststorj=# select (timestamp with time zone '2020-02-01 00:00:00+00')::date;
date
------------
2020-01-31
(1 row)
teststorj=# set timezone = 'UTC';
SET
teststorj=# select (timestamp with time zone '2020-02-01 00:00:00+00')::date;
date
------------
2020-02-01
(1 row)
In order to correctly determine the date a timestamp is in, one has to
explicitly pick the time zone that the date truncation should use
otherwise postgres will use whatever setting the client has. These
tests were failing for me locally, because I run my postgres in
the US/Eastern time zone to try to tickle these bugs out. So it
should be `(x at time zone 'UTC')::date` instead of just `x::date`.
Change-Id: I4e9e32d4b53abc6165a4d0474f4702f8b9f801c7
The current satellite config lock code relies on bash scripts and
gnu diff, it must be run as root and hence it typically requires
docker. The old version will be removed at a later date..
I tried for several hours to run directly against cmdSetup() in
cmd/satellite/main.go, to avoid the ctx.Compile() call. I had no
luck.
Change-Id: I0a4888421e743b436d32b6af69d04759d7816751
See https://storjlabs.atlassian.net/browse/SM-752
These changes allow us to change the log level at runtime through a handler off of the debug endpoint.
Examples of changing the log level on storj-sim
To get the current level for the satellite api process:
curl -XGET 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain'
To change the log level:
curl -XPUT 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain' --data-raw '{"level":"error"}'
Change-Id: I05d164b290929fa06b6d78c01075ee41f8238044
We have 3 types of discounts:
1) Promotional credits/coupons
2) Bonus from depositing STORJ tokens
3) Stripe discounts (e.g. 100% off for Storjlings, 30% off for Early
Adopters, etc.)
So far the discounts were applied in the above order. But because the
Stripe discount is applied on all of the project usage fees, this could
sometimes lead to negative total in the invoice. Especially, if the
Stripe discount is 100% or all of the project fees are covered by
coupons and bonus.
To resolve this issue, before applying promotional coupons and deposit
bonuses, the Stripe discount will be applied first to the project fees.
Change-Id: I5dcbec04ec3a04e7f76b11e0a228ccb3195f2db0
Before:
- Discount from coupon: Promotional credits (limited time -
2 billing periods)
- Discount from credits
After:
- Promotional credits (limited time - 2 billing periods)
- Credits from STORJ deposit bonus
This way we don't mix the terms coupon and credit. And it is clearer
when the credit comes from a deposit bonus.
Change-Id: I4bba76a5501147f9de399eac41c4f157d6bda032
The billing history currently shows the Total amount from the Stripe
invoice. In fact, this value is just the amount deducted from the Stripe
balance. It does not reflect any deduction from promotional coupon or
bonus credits.
This patch adds the deducted amount form the promotional coupons and
bonus credits to the displayed amount in billing history. This way
customers have better understading of the total amount deducted from
their account balance on the satellite.
Change-Id: Ibd7f611a6cea0143a3059f39dd1d9ef21c264d8c
GetNodes returned references to nodes in the immutable state, however
some parts of code expect them to be modified.
Change-Id: I5be1866f95e0dbe062a6b6be60e29f2365c35faa
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.
Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
Add a flag that allows us to easily switch disqualification from
suspension mode on or off. A node will only be disqualified from
suspension mode if it has been suspended for longer than the grace
period AND the SuspensionDQEnabled flag is true.
Change-Id: I9e67caa727183cd52ab2042b0a370a1bcaebe792
Currently uploads can cause a lot of IOPS, reduce this by introducing a
in-memory buffer on-top of the file.
Change-Id: I5f4e3e01c0a36258271d180b922107de447bcb59
TestVerifierSlowDownload would sometimes not have enough nodes finish in
the allotted deadline period. This increases the deadline and also does
not assert that exactly 3 have finished. Instead, in keeping with the
purpose of the test, it asserts that the slow download is never counted
as a success and is always counted as a pending audit in the final
report.
Change-Id: I180734fcc4a499420c75164bad6253ed155d87de
CreateTables hasn't been quite true for a while now, rename to
MigrateToLatest to be clearer in it's behavior.
Change-Id: Ida48e95122a5d9b7a814e922d3698e00024a2ba7
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.
Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
time.Now() in a short amount of time can return the exact same value.
Ensure that the test uses times that are distinct.
Change-Id: Ia653ce0af4bfcf7b5da133a9cf98b823033d9592
Before the deleter would close its done channel once, so if additional
tests shared a storagenode, even if not in parallel, the later waits
would not work properly. This fixes that problem.
Change-Id: I7dcacf6699cef7c2c2948ba0f4369ef520601bf5
CustomerRepositoryList relied on created at time for the output sorting,
however the granularity of the timer may cause multiple customers
having the same "created at" time.
Add a sleep so that every customer does get a unique time.
Change-Id: If2923174f304fa5f41260d500f6139e3fa7c3ba5
Delay of 100ms could happen due to other things happening on the test
server. Increase the time to 1s.
Change-Id: I2c7c21f966101771633d73e84cf9850d28089e71
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)
Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
When running testplanet tests, mark storagenode peer PieceDeleter as in
testing mode so that you don't have to do it on each test.
Change-Id: I2592e02c63f8bcc9152ecf436bac4e798b08bccf
Currently Cockroach isn't performant for concurrent database setup and
tear-down. Instead of a single instance allow setting multiple potential
connection strings and let the tests pick one connection string
randomly.
This improves test duration by ~10 minutes.
While we are at significantly changing how pgtest works, introduce
helper PickPostgres and PickCockroach for selecting the database to
reduce code duplications in multiple places.
Change-Id: I8ad171d5c4c8a4fc081ec2ae9bdd0cc948a80619
In cases like the segment reaper script connecting to the metainfodb,
we don't want a db migration to happen automatically when we call
metainfo.NewStore. This adds MigrateToLatest method for postgreskv
and cockroackv, and calls MigrateToLatest in places where NewStore used
to create tables.
Change-Id: I682d0f26d609af0601dfdb32a24866cdf5d32a7e
There was a race in the test code for piece deleter, which made it
possible to broadcast on the condition variable before anyone was
waiting. This change fixes that and has Wait take a context so it times
out with the context.
Change-Id: Ia4f77a7b7d2287d5ab1d7ba541caeb1ba036dba3
A/B indicates that B is a subtest of A, however in this case they
represent a configuration of the test, not a subtest.
Change-Id: I64eed5d5bcb12759e54fe4b5373f8e88488e50f7
Added a per IP rate limiter to the console web.
Cleaned up password check to leak less bcyrpt info.
Change-Id: I3c882978bd8de3ee9428cb6434a41ab2fc405fb2
To improve delete performance, we want to process deletes asynchronously
once the message has been received from the satellite. This change makes
it so that storagenodes will send the delete request to a piece Deleter,
which will process a "best-effort" delete asynchronously and return a
success message to the satellite.
There is a configurable number of max delete workers and a max delete
queue size.
Change-Id: I016b68031f9065a9b09224f161b6783e18cf21e5
Update unknown_audit_reputation_alpha and unknown_audit_reputation_beta.
Add test to verify that BatchUpdateStats properly modifies unknown audit
alpha/beta
Change-Id: I0d5f9cac96a99f64905cf575b772402db0756a9d
If a node is suspended and receives an unknown or failing audit,
disqualify them if the grace period (default 1w in production) has
passed.
Migrate the nodes table so any node that is currently suspended gets
unsuspended when the satellite starts up.
Change-Id: I7b81c68026f823417faa0bf5e5cb5e67c7156b82
Currently it was possible that PopAll returns 1010 items, then
makes one RPC call with 1000 items, then RPC call 10 items. Meanwhile,
there have been added 500 new items added to the queue.
This change ensures that we pull items from the queue early and
try to make rpc batches as large as possible.
Change-Id: I1a30dde9164c2ff7b90c906a9544593c4f1cf0e9
This reverts commit 105dc7acc6.
Reason for revert: Recent changes to the Postgres query plan seems to want to use this index now. Reverting until we have time to analyze what's happening.
Change-Id: I74b4b5a8f15c3850d8a958a29f51dbc80e7c282c
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue
Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
Replace most of old libuplink usages in testplanet. 100% migration will
be possible when we will be able to implement UploadWithClientConfig
with new libuplink.
Change-Id: I432d7d4917c7b67d46a058abd0a2a6a13f565ac4
Automatically attach attribution information to bucket during
BeginObject or CreateBucket when the UserAgent is set.
Change-Id: I405cb26c5a2f7394b30e3f2cf5d2214c8781eb8b
We want to avoid net/http dependency in errs2 package, hence we removed
http.ErrServerClosed from IgnoreCanceled and IsCanceled check. Now we
need to add that check explicitly to every http endpoint.
Change-Id: I62b1cc0a0a2d3b43301d713a7951e5022145f88f
This adds support for observers to join the loop together. This allows
to ensure that when multiple observers join, they will be part of the
same loop iteration.
Change-Id: Ie887d4cedfb074b65c782690a2c09c1704f56dfe
During testing it's possible to get into a scenario where all nodes are
offline and list of requests is empty.
Change-Id: I271c0ca2c72009244df13e8bc1441fcd5f3da9e0
* satellite: update log levels
Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f
* log version on start up for every service
Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de
* Use monkit for tracking failed ip resolutions
Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2
* fix compile errors
Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf
* add request limit value to storage node rpc err
Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5
* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders
Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887
* fix build errs
Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
Currently storj-sim relies on the log lines to be exactly the same,
when they change it cannot find the necessary information from log.
Change-Id: Ia039915ef3375a7cf60f107b2c05c958de15b6d5
Currently ListV2 loaded the whole data into memory, even when all the
data wasn't being used, using up more memory than needed.
Change-Id: I5846d979344729b447c108a6cc9f4227229ec981
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.
Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
We want to increase our throughput for downtime estimation. This commit
adds the ability to reach out to multiple nodes concurrently for downtime
estimation. The number of concurrent routines is determined by a new config
flag, EstimationConcurrencyLimit. It also increases the default
EstimationBatchSize to 1000.
Change-Id: I800ce7ec1035885afa194c3c3f64eedd4f6f61eb
until we do a good job of cleaning them up, we should at least
not charge or pay people for them. nodes already locally delete
expired segments.
subsumes the tests in 1112.
Change-Id: I5961185764e02f6136b3231b44ecc75a9a8832c9
Whenever the node's reputation is updated, if its unknown audit
reputation is below the suspension threshold, its suspension field
is set to the current time. This could overwrite the previous
"suspendedAt" value resulting a node that never reaches the end of
its suspension.
Also log whenever a node is disqualified or its suspension status
changes
Change-Id: I5e8c8f1c46f66d79cb279b5b16a84fe03f533deb
Reduce the number of non-methods to reduce funcs in the namespace also
combine a func to slightly condense the code more.
Change-Id: Ifbe728eb8c8ca4c981df648decd259c2097b6b40
* Add migration to storagenode reputation table to add suspended
timestamp
* Send suspended info to storagenode from satellite nodestats endpoint
* Add suspended status to storagenode api
* Add an indicator on the storagenode dashboard informing operator of
the satellites the node is suspended on
Change-Id: Ie3669f6069cc0258ba76ec99d17006e1b5fd9c8a
potential encryption overhead.
This is the same approach we have for validating remote segment size.
https://storjlabs.atlassian.net/browse/USR-619
Change-Id: I2597ee734313a3068fd986001680bbedbf1bed2a
Adds a test to make sure that the correct amount of total nodes,
reputable nodes, and new nodes are returned by SelectStorageNodes
in different cases.
Change-Id: I0939159600afde8a46c35735f1edf0576fcdb4cd
This adds new endpoint /api/user/{user-email} which allows to get the
projects where the user is a member.
It also moves existing endpoint:
/project/{projectid}/limit -> /api/project/{projectid}/limit
To avoid future conflicts for displaying pages.
Change-Id: I5efe3e1c8f79894c136f92ed815f635a34ba6f98
size with BeginObject
Such solution will add one round trip to satellite during upload so for
now we are reverting this until we will have solution for this.
Change-Id: Ic2d826448ab7b0318cd6922df05deee9167cf2f0
We have been using the SQL expression `name='(*Verifier).Verify' AND
error_name='not enough shares for successful audit'` thus far to detect
cases of this problem and alert on them. Unfortunately, since this
rarely (hopefully never) happens, influxdb has no data for most of the
auditor instances, and when it has no data for a time series, it returns
no columns either. This makes Redash upset when it tries to perform a
query for an alert and can't find the column whose value it expects to
check.
This change should make it so zero values are reported when the problem
has not happened, and higher values when it has.
Change-Id: I79e5e000f879678b661dac88caae1e2915b39ab1
there are a subset of storagenodes hammering the satellite with
expired orders. if we check for expiration first, we don't have
to do a bunch of pointless signature verification. since a && b
is equal to b && a, we can order these checks in any way we want
and have it still be correct.
Change-Id: I6ffc8025c8b0d54949a1daf5f5ea1fed9e213372
BeginObject response
We want to control inline segment size and segment size on satellite
side. We need to return such information to uplink like with redundancy
scheme.
Change-Id: If04b0a45a2757a01c0cc046432c115f475e9323c
all permissions
Without read and list permissions BeginObjectDelete won't return error
if occurs. This was breaking Batch processing because there was
assumption that without error response will be always not nil.
https://storjlabs.atlassian.net/browse/SM-590
Change-Id: I0fc9539e429110a660eb28725b266d5e4771d198
uuid.UUID implements driver.Value so it can be directly used as a
scannable result.
Replace uses of dbutil.BytesToUUID with uuid.FromBytes.
Change-Id: I51a670185ceb3cc2199d5aa2b76bc3fc191ca8fe
Instead of providing the database from outside to testplanet create it
inside and then allow wrapping and modifying it. This is more convenient
to use.
Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.
This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.
Updates tests to test repair for both in memory and to disk.
Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
we still need to come up with a better plan to get storage nodes
to stop doing this, but in the meantime, we know this is happening,
just stop logging it and keep some stats instead.
Change-Id: Icb6bcba275e0e955c54b1a90da2b37219fff2349
storagenodes have like 10 or more databases. without this
tag they all get sent as the same value, stomping on each
other.
Change-Id: Ib12019684d6ea8f2a5b83df584056dfa79e3c4b3
This will help to determine how many grpc calls are made to the
satellite.
Also remove the grpc funcs that have been added to upstream.
Change-Id: I91878f4fd10f9bfe601c94222c102eaaf4d35963
* debug
* traces
* cfgstruct
* process
Package `storj/private/version` will be removed as a separate change.
Change-Id: Iadc40faa782e6225513b28218952f02d9c240a9f
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
This implements a service for pointer verification. This makes the
slightly clearer, because it's not part of metainfo.
It also adds a peer identity cache which reduces database calls and peer
identity decoding.
Change-Id: I45da40460d579c6f5fd74c69bccea215157aafda
step 1 in https://review.dev.storj.io/c/storj/uplink/+/1236
Now the old libuplink uses the temporary DeleteBucketReturnDeleted and
DeleteObjectReturnDeleted methods. This way, in the next step, we will
be able to change the DeleteBucket and DeleteObject methods to return
the deleted bucket/object.
Change-Id: I2e638be1960bca6ce1456c92849fcdd6d93e5252
by doing an indexed anti-join we're able to reduce the time to
select the pending orders by over 10x on postgres. this should
help us process pending orders much more quickly.
it probably won't do as good a job on cockroach because it does
not do an indexed anti-join and instead does a hash join after
scanning the entire consumed serials table. we should either
remove orders entirely or try to make that more efficient
when necessary.
Change-Id: I8ca0535acd21c51e74955b24c9b86d20e4f2ff9c
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair
This commit also removes unused overlay functionality.
Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).
Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states
Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
Initial change for checking bucket existence on satellite side for
requests like BeginObject and ListObjects. This is simple implementation
that is just checking bucket in DB but should be improved in future to
avoid DB calls as much as possible.
Part of https://storjlabs.atlassian.net/browse/USR-365
Change-Id: I9076acddc44d7dbfa7612a1c24a007de01621583
This adds a piece deletion handler that has debounce for failed dialing
and batching multiple jobs into a single request.
Change-Id: If64021bebb2faae7f3e6bdcceef705aed41e7d7b
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
To handle concurrent deletion requests we need to combine them into a
single request.
To implement this we introduces few concurrency ideas:
* Combiner, which takes a node id and a Job and handles combining
multiple requests to a single batch.
* Job, which represents deleting of multiple piece ids with a
notification mechanism to the caller.
* Queue, which provides communication from Combiner to Handler.
It can limit the number of requests per work queue.
* Handler, which takes an active Queue and processes it until it has
consumed all the jobs.
It can provide limits to handling concurrency.
Change-Id: I3299325534abad4bae66969ffa16c6ed95d5574f
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
We missed this in the migration that added the num_healthy_pieces
column. It exists in dbx, but not on the actual satellite table.
Change-Id: If16b5ec2325d56406250298531b3285215188bf3
Metainfo method validateAuth checks things like API key, user permission
and rate limit but at the end all errors were returned as
rpcstatus.Unauthenticated.
Old Metainfo is not touched to avoid backward compatibility issues.
Change-Id: I78eb276210fc50151da58a5c84e13ecd0961da29
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).
Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.
This change also better differentiates some error cases from Repair()
for monitoring purposes.
Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
New API has limited number of options to configure at the moment. We
should remove unused flags from Uplink CLI and add if needed in the
future.
Change-Id: Icf3f3dadd43cb61a3b408b02d0762aef34425dbf
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.
Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
- Previously, checkSegmentAltered only checked for segments that were replaced
but we want to detect all changes to a segment that occurred while an audit was being conducted.
- Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified
segments were not being removed from containment mode.
- Filled in gaps in reverify testing to ensure nodes are properly removed from containment.
Change-Id: Icd96d369278987200fd28581395725438972b292
The billing tests were flaky because some assertions ran before the
storage nodes finish their work.
A new helper function in testplanet has been added to allow to wait for
storage nodes endpoints to finish their work. This function now it's
used in the billing tests for avoiding their flakiness.
This commit closes the ticket:
https://storjlabs.atlassian.net/browse/SM-403
A part of fixing other billing tests flakiness.
Change-Id: Iacb750af435f515c04b1e1d3510a218d184c9abc
uplinks
New libuplink is not storing encryption values in with bucket but old
uplinks are using those values for configuration. If bucket was created
with new libuplink we will send back satellite defaults.
Change-Id: Ie1bf3682847e07b302270b4c4bf1a7219f4bf011
Submit an order limit with a high amount but the order has a low amount of traffic.
Make sure the order amount is used for billing.
Change-Id: I6b6ae26e9b8896f4a3acf530b2f48510b6df89cc
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.
Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.
Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
Add a test for checking that the billing:
* it doesn't include upload traffic
* it includes download traffic
Change-Id: I1655c15c1fad642f77dd210f2014b2586ae10104
This change is a special case for batch processing. If in batch request
CommitSegment and CommitObject are one after another we can execute
these requests as one. This will avoid current logic where we are saving
pointer for CommitSegment and later we are deleting this pointer and
saving it once again as under last segment path for CommitObject.
This change should handle issue we have in older uplinks with incorrect
order of storing pointers.
Change-Id: I86514c95df169e6fbc91b52e5117472cae70cb8b
these tables are used in future commits with respect to the new
storagenode payments code. if we create them now, it will make
backfilling them with historical data easier.
Change-Id: I3c08c9770ec5b2baa38b4f2fd18c2f07746a61c2
Remove the check around consuming an expired serial so that we
have more time to run the migration. It does open a small race
of double spends for entries already counted and then added to
the pending queue right around when they're going to expire and
the consumed serials have already been removed, but that should
be rare if we keep the pending queue empty.
Change-Id: I000b15979b09c67751281ff675ea6c81fc9d22dc
common/pb moved grpc to a separate package common/pb/pbgrpc.
This updates this repository to use it.
Change-Id: I2de2a190688871cf9cb61f7ea511f8a01e264e4e
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.
The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.
We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.
Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.
This is a move from Github PR #3645.
Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35
we want to return back to the user as quick as possible but also keep
deleting remaining pieces on the storagenodes
Change-Id: I04e9e7a80b17a8c474c841cceae02bb21d2e796f
Curently, storage nodes only report their capacity to satellites
once per hour. If a node fills up, it will fail all uploads until
the next contact cycle begins. With these changes, at the end of an
upload we check whether the MinimumDiskSpace threshold has been
passed. If so, trigger the monitor chore to update the node's
capacity, then trigger the contact chore to report the new
capacity to the satellites
Change-Id: Ie6aadaade1e2c12c87e03f8ff9059a50121380a0
Enhance the documentation of the UseSerialNumber method (interface and
implementation) and add several missing dots in doc comments of the
methods of the same interface and implementation.
Change-Id: I792cd344f0d2542e060fa2ec288b71231cae69de
at the end of tally iteration, in order to set the new live
accounting totals, we were iterating over all live accounting
projects. We found a bug with this when running storj-sim. If
we restarted the satellite live accounting would be cleared
because storj-sim was running the live accounting redis instance.
Since live accounting was cleared, at the end of tally, even if
it found data in projects, we would not update the live accounting
totals because we were iterating over the projects from live
accounting to do so. We now iterate over projects found from tally
in order to update live accounting
We also found that if a user deleted everything from their project,
tally would not find it and the live accounting would not be updated.
For this reason, we merge live accounting projects into tally projects
Change-Id: If0726ba0c7b692d69f42c5806e6c0f47eecccb73
rationale: if GC kills the satellite, it would be nice to make
it through a repair checker sweep first
Change-Id: Id56171dc8e13940cfb6481e36a910bad077a01ed
Trace the calls to DeletePiecesService.DeletePieces method and add
metrics for having statistics about the rate that specific storage node
is dialed and duration time spent on dialing storage nodes.
These statistics will help us to find out if we should implement
connections queues to storage node for reducing the deletion time in cae
that we see that we're spending too much time dialing frequent storage
nodes.
Ticket: https://storjlabs.atlassian.net/browse/SM-85
Change-Id: I9601676c3a8ad96c73c93833145929e4817755e2