add new columns `offline_suspended` and `under_review` to nodes table.
`unknown_audit_suspended` is a new column which will replace `suspended`
Change-Id: I22ddeb338ea0ff63f14332a7ebd0f3e9e4c06cdc
This ensures that rows are closed to avoid leaks.
Also verifies that Err() is called, to ensure that no
error is left behind.
Change-Id: Idd1bec9bf479f40021da67b2c80ce83033149469
To avoid including multiple months in a single invoice, we need all
inspector's invoice commands to run in for specific period.
See https://storjlabs.atlassian.net/browse/USR-725
Change-Id: I3637dc189234f02350daca8d897c21765762ea55
This reverts commit 105dc7acc6.
Reason for revert: Recent changes to the Postgres query plan seems to want to use this index now. Reverting until we have time to analyze what's happening.
Change-Id: I74b4b5a8f15c3850d8a958a29f51dbc80e7c282c
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair
This commit also removes unused overlay functionality.
Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).
Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states
Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
these tables are used in future commits with respect to the new
storagenode payments code. if we create them now, it will make
backfilling them with historical data easier.
Change-Id: I3c08c9770ec5b2baa38b4f2fd18c2f07746a61c2
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.
The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.
We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.
Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
This change adds two new tables to process orders as fast as we used
to but in an asynchronous manner and with hopefully less storage
usage. This should help scale on cockroach, but limits us to one
worker. It lays the groundwork for the order processing pipeline to
be queue rather than database driven.
For more details, see the added fast billing changes blueprint.
It also fixes the orders db so that all the timestamps that are
passed to columns that do not contain a time zone are converted to
UTC at the last possible opportunity, making it less likely to use
the APIs incorrectly. We really should migrate to include timezones
on all of our timestamp columns.
Change-Id: Ibfda8e7a3d5972b7798fb61b31ff56419c64ea35
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
before dbx would generate a compilcated blob of conditions that
encoded a row comparison, which only optimized to an index seek
on cockroachdb. this means that sqlite and postgres both had
quadratic behavior on paged queries of this form. instead, use
the implicit row construction feature supported in all of the
databases to do paged support so that they all optimize well.
Change-Id: Iac8703929ba2a59ee3ffa619b916d12663422887
Limits how many times metainfo APIs can be called per second by project ID. If limit is exceeded, the API will return Unauthorized/Too Many requests.
Limit per second and the size of the limiter cache per project are configurable, as well as whether the limiter is enabled.
Tests added/updated for the new rate_limit field in projects table.
Tests added for exceeding limits and disableing limiter.
Change-Id: Ic8ad102de3b690a475809d4f684156d5715f20fa
warning: databases migrated to version 77 before this commit
is merged must be manually re-migrated. this should not be a
problem for anything but staging databases.
Change-Id: Ie1631c48379472352014183ee43f1465e22200f7
this commit introduces the reported_serials table. its purpose is
to allow for blind writes into it as nodes report in so that we have
minimal contention. in order to continue to accurately account for
used bandwidth, though, we cannot immediately add the settled amount.
if we did, we would have to give up on blind writes.
the table's primary key is structured precisely so that we can quickly
find expired orders and so that we maximally benefit from rocksdb
path prefix compression. we do this by rounding the expires at time
forward to the next day, effectively giving us storagenode petnames
for free. and since there's no secondary index or foreign key
constraints, this design should use significantly less space than
the current used_serials table while also reducing contention.
after inserting the orders into the table, we have a chore that
periodically consumes all of the expired orders in it and inserts
them into the existing rollups tables. this is as if we changed
the nodes to report as the order expired rather than as soon as
possible, so the belief in correctness of the refactor is higher.
since we are able to process large batches of orders (typically
a day's worth), we can use the code to maximally batch inserts into
the rollup tables to make inserts as friendly as possible to
cockroach.
Change-Id: I25d609ca2679b8331979184f16c6d46d4f74c1a6
everyone was importing it as dbx anyway. why should it be
named satellitedb? so yeah just pass the "-p dbx" flag.
Change-Id: I5efa669f4f00f196b38a9acd0d402009475a936f
This reverts commit 8e242cd012.
Revert because lib/pq has known issues with context cancellation.
These issues need to be resolved before these changes can be merged.
Change-Id: I160af51dbc2d67c5449aafa406a403e5367bb555
this will allow for some nice runtime analysis down the road.
also, this allows for wrapping database handles in a way that
can interact with these contexts
requires https://review.dev.storj.io/c/storj/dbx/+/514
Change-Id: Ib087b7cd73296dd2c1e0331314da34d861f61d2b
When an uplink requests an upload or download from the satellite we are trackig the
allocated bandwidth twice. The value in bucket_bandwidth_rollups is used
for project limits but the value in storagenode_bandwidth_rollups is not
used at all. We can increase the performance by removing it. Uplinks
will get a faster response from the satellite.
Change-Id: Icccd41f94107ef34668f30f99bf5f728c384b07e
turns out portable sed is hard: it has to work with both
linux and bsd sed, etc. instead, use a really really basic
bash script and a temporary file. this should be much less
likely to cause issues on a wide range of machines.
Change-Id: Ia759789fb52aa1ee3361426bb6c02ed4eac3d23a
overlay.GetOfflineNodesLimited
We only care about node ID, address, and last contact success/failure
from the downtime service, so the overlay should only return these
values for the downtime-specific queries.
Change-Id: I08a6ecfdd2a12b82cae62e87d6adeab53975bfce
* satellitedb/certDB: refactors of the node certificate storage DB table
The existing implementation doesnt allow to store the complete certificate chain of uplinkIDs or storagenodeIDs, so the current table is dropped and new table will be added which addresses the storage and retrieval of certificates
pkg/identity: fixes spelling mistakes that I missed on PR#2754
Fixes V3-1992/V3-2388
* add default offer for offers table
* fix migration test
* Trigger Jenkins
* set the default value to be correct type
* skip soon will deleted test
* fix test data
* add orderby for ListAll
* change durations, redeemable cap to be a nullable field
* remove unecessary code
* organize offers
* revert changes to go.mod and go.sum
* change OfferStatus enums back to original
* revert modified auto-gen files
* don't render empty row if offers is empty
* change return val of ListAll to Offers
* fix build
* add method to check for empty offer when rendering template
* fix typo
* fix lint and typos
* lean out IsEmpty
* dont use named return vals
* better clarify offer statuses
* change back order of setting offer.Status
* lint
* satellite/marketingweb: allow disabling rewards (#2392)
* implement handler for stop offer endpoint
* use proper text and fix data-target for free-credit stop modal
* add db interface and methods, add sa metainfo endpoints and svc
* add bucket metainfo svc funcs
* add sadb bucekts
* bucket list gets all buckets
* filter buckets list on macaroon restrictions
* update pb cipher suite to be enum
* add conversion funcs
* updates per comments
* bucket settings should say default
* add direction to list buckets, add tests
* fix test bucket names
* lint err
* only support forward direction
* add comments
* minor refactoring
* make sure list up to limit
* update test
* update protolock file
* fix lint
* change per PR
* add bucket metadata table in SA masterDB
* fix indentation
* update db model per CR comments
* update testdata
* add missing field on sql testdata
* fix args to testdata
* unique bucket name
* fix fkey constraint for test
* fix one too many commas
* update timestamp type
* Trigger Jenkins
* Trigger Jenkins yet again
* satellite/satellitedb: Alter nodes disqualification column
Change the type of the 'disqualification' column of the nodes table from
boolean to timestamp.
* overlay/cache: Change Disqualified field type
Change the Disqualified field type the NodeDossier struct type from bool
to time.Time to match with the disqualified type used by the DB layer.
* satellite/satellitedb: Update queries uses disqualified
Update the queries which uses the disqualified column due to the column
type has been changed from boolean to nullable timestamp.
* docs/design: Update disqualification due impl changes
Update the disqualification design document to contain the architectural
change required to be able to restore unfair disqualified nodes in case
of an unexpected cause (bug, mistake, hard network disconnection, etc.).
* add user credits table
* change primary key, change type for credit_type, and change relation kind of foreign keys from cascade to restrict
* modify table and query methods
* modify schema
* add dbx queries
* add migration file
* add orderby to read available credit entries
* adds model to satellite dbx
* cleans up model spacing
* generated golang from dbx
* added migration steps
* Added testdata
* changed node_id -> bucket_id
* adds -- NEW DATA -- to testdata
* more testdata changes
* adds -- NEW DATA -- line
* dbx makes the table plural
* missed a singular value_attribution
* restart jenkins
* Update satellitedb.dbx
* adjust to PR comments
* autogenerated dbx models
* restart jenkins
* init marketing service
Fix linting error
Create offerdb implementation
Create offers service
Add update method
Create offer table and migration
Fix linting error
fix conflicts
Insert new data
Change duration to have clear indication to be based on days
add error wrapper
Change from using uuid to int for id field
* Create Marketing service
* make error virable name more readable
* add condition in update service method to check offer status
* generate lock file
Change get to listAllOffers
* Add method for getting current offer
wip
* add check for expires_at in update method
* Fix conflicts
* add copyright header
* Fix linting error
* only allow update to active offers
* add isDefault argument to GetCurrent
* Update lock file
* add migration file
* finish migrate for adding credit_in_cents for both award and invitee
* save 100 years as expiration date for default offers
* create crud test for offers
* add GetCurrent test
* modify doc
* Fix GetCurrent to work with default offer
* fix linting issue
* add more tests and address feedbacks
* fix migration file
* add type column back to match with mockup design
* add type column back to match with mockup design
* move doc changes to new pr
* add comments
* change GetCurrent to GetCurrentByType
* fix typo
What: Changes to support custom usage limit for the project. With this implementation by default project usage limit is taken from configuration flag. If project DB field usage_limit will be set to value larger than 0 it will become custom usage limit and we will be used to verify is limit was exceeded.
Whats changed:
usage_limit (bigint) field added to projects table (with migration)
things related to project usage moved from metainfo endpoint to project usage type
accounting.ProjectAccounting extended with GetProjectUsageLimits() method
Why: We need to have different usage limits per project. https://storjlabs.atlassian.net/browse/V3-1814
* add last_ip field to dbx model node, generate dbx
* add last_ip to node proto, generate pb
* migrate
* resolve address in transport.DialNode, update lastIp in cache.UpdateAddress
* use net.SplitHostPort to isolate host address from port
* define DistinctIPs flag
* add test for GetIP
* select last_ip when querying for nodes
* if distinctIPs flag == true, query for nodes with distinct IPs
* some basic tests
* change last_ip to field 14 in proto
* remove comments
* check err
* change distinctIPs to distinctIP
* exclude IPs from newNodes in query for reputable nodes
* add index on last_ip
* only add to excludedIPs if flag is true
* test half new nodes returns distinct IPs
* fix alignment
* add test
* rework ip filter query, add retry logic, add switch for database driver
* add retry to SelectNewNodes
* change discovery intervals so IPs don't get overwritten
* remove TestGetIP
* edit updating node stats in test
* split exclude into nodeIDs and IPs
* separate non-distinct IP query into other function
* trigger checks
* remove else block
* Initial Webserver Draft for Version Controlling
* Rename type to avoid confusion
* Move Function Calls into Version Package
* Fix Linting and Language Typos
* Fix Linting and Spelling Mistakes
* Include Copyright
* Include Copyright
* Adjust Version-Control Server to return list of Versions
* Linting
* Improve Request Handling and Readability
* Add Configuration File Option
Add Systemd Service file
* Add Logging to File
* Smaller Changes
* Add Semantic Versioning and refuses outdated Software from Startup (#1612)
* implements internal Semantic Version library
* adds version logging + reporting to process
* Advance SemVer struct for easier handling
* Add Accepted Version Store
* Fix Function
* Restructure
* Type Conversion
* Handle Version String properly
* Add Note about array index
* Set temporary Default Version
* Add Copyright
* Adding Version to Dashboard
* Adding Version Info Log
* Renaming and adding CheckerProcess
* Iteration Sync
* Iteration V2
* linting
* made LogAndReportVersion a go routine
* Refactor to Go Routine
* Add Context to Go Routine and allow Operation if Lookup to Control Server fails
* Handle Unmarshal properly
* Linting
* Relocate Version Checks
* Relocating Version Check and specified default Version for now
* Linting Error Prevention
* Refuse Startup on outdated Version
* Add Startup Check Function
* Straighten Logging
* Dont force Shutdown if --dev flag is set
* Create full Service/Peer Structure for ControlServer
* Linting
* Straighting Naming
* Finish VersionControl Service Layout
* Improve Error Handling
* Change Listening Address
* Move Checker Function
* Remove VersionControl Peer
* Linting
* Linting
* Create VersionClient Service
* Renaming
* Add Version Client to Peer Definitions
* Linting and Renaming
* Linting
* Remove Transport Checks for now
* Move to Client Side Flag
* Remove check
* Linting
* Transport Client Version Intro
* Adding Version Client to Transport Client
* Add missing parameter
* Adding Version Check, to set Allowed = true
* Set Default to true, testing
* Restructuring Code
* Uplink Changes
* Add more proper Defaults
* Renaming of Version struct
* Dont pass Service use Pointer
* Set Defaults for Versioning Checks
* Put HTTP Server in go routine
* Add Versioncontrol to Storj-Sim
* Testplanet Fixes
* Linting
* Add Error Handling and new Server Struct
* Move Lock slightly
* Reduce Race Potentials
* Remove unnecessary files
* Linting
* Add Proper Transport Handling
* small fixes
* add fence for allowed check
* Add Startup Version Check and Service Naming
* make errormessage private
* Add Comments about VersionedClient
* Linting
* Remove Checks that refuse outgoing connections
* Remove release cmd
* Add Release Script
* Linting
* Update to use correct Values
* Change Timestamp handling
* Adding Protobuf changes back in
* Adding SatelliteDB Changes and adding Storj Node Version to PB
* Add Migration Table
* Add Default Stats for Creation
* Move to BigInt
* Proper SQL Migration
* Ensure minimum Version is passed to the node selection
* Linting...
* Remove VersionedClient and adjust smaller changes from prior merge
* Linting
* Fix PB Message Handling and Query for Node Selection
* some future-proofing type changes
Change-Id: I3cb5018dcccdbc9739fe004d859065992720caaf
* fix a compiler error
Change-Id: If66bb92d8b98e31cd618ecec9c6448ab9b037fa5
* Comment on Constant for Overlay
* Remove NOT NULL and add epoch call as function
* add versions to bootstrap and satellites
Change-Id: I436944589ea5f21600cdd997742a84fe0b16e47b
* Change Update Migration
* Fix DB Migration
* Increase Timeout temporarily, to see whats going on
* Remove unnecessary const and vars
Cleanup Function calls from deprecated NodeVersion struct
* Updated Protopuf, removed depcreated Code from Inspector
* Implement NodeVersion into InfoResponse
* Regenerated locked.go
* Linting
* Fix Tests
* Remove unnecessary constant
* Update Function and Flag Description
* Remove Empty Stat Creation
* return properly with error
* Remove unnecessary struct
* simplify migration step
* Update Inspector to return Version Info
* Update local Endpoint Version Handling
* Reset Travis Timeout
* Add Default for CommitHash
* single quotes
* reorg uplink cmd files for consistency
* init implementation of usage limiting
* Revert "reorg uplink cmd files for consistency"
This reverts commit 91ced7639bf36fc8af1db237b01e233ca92f1890.
* add changes per CR comments
* fix custom query to use rebind
* updates per convo about what to limit on
* changes per comments
* fix syntax and comments
* add integration test, add db methods for test
* update migration, add rebind to query
* update testdata for psql
* remove unneeded drop index statement
* fix migrations, fix calculate usage limit
* fix comment
* add audit test back
* change methods to use bucketName/projectID, fix tests
* add changes per CR comments
* add test for uplink upload and err ssg
* changes per CR comments
* check get/put limit separately
* add MetadataSize to stats
* add logic for accumulating bucket stats in calculateAtRestData
* rename stats to BucketTally, move to accounting package
* define method on accountingDB for inserting bucketTallies
* insert bucketTallies into bucket_storage_tally table
* go gen
* undo changes to bucket usage
* update locked
* spacing
* moves changes to migration v11
* minor changes to fix lint and test err
* change sql to fix errs
This change adds satellite endpoint for receiving OrderLimits sent by storage node.
Change includes:
* wire up orders sender in storage node (also in testplanet)
* saving serial number for OrderLimit in serial_numbers table
* satellite endpoint for receiving, verifying and storing OrderLimit and Order serial number
* initial implementation for Orders DB
* basic test for sending orders to satellite
* Move from Unique to Index
* Remove Index
* Make some more Indexes Unique and adjust migration
* Fix Migration Statements
* Fix Typo
* Fix Migration of older Table
* Exchange DROP statement
* Remove "if not exists"
* Revert Change in old Migration
* define irreparable inspector protobuf
* add IrreparableDB method GetLimited
* fill out irreparable inspector API
* add IrreparableInspector server to satellite, fix small error
* refactor IrreparableDB to use pb.IrreparableSegment instead of irreparable.RemoteSegmentInfo
* Wiring up DumpNodes response for Inspector
* Finalize everything and test that it works
* Get Count and DumpNodes working for Overlay Cache
* WIP updating payment rollup to check statDB instead of overlay
* FIrst pass at updating statDB to take wallet and email
* Passing tests
* use pb.NodeOperator instead of Meta struct
* remove TODO
* revert go.mod
* Get SQL migration working correctly
* Changes Meta to Operator in NodeStats struct
* Adds update operator logic for statDB
* Fix db migrate tests - added v5 snapshot
* User friendly msg for missing snapshot version
* Passing tests
* Change node update to happen in discovery instead of in overlay
* Fix logic and update function calls
* Update comment on UpdateOperator interface method
* Update name of parameter
* Change type of argument to UpdateOperator
* Updates statDB tests
* Divide uplink and gateway params set
* attempt to fix docker
* attempt to fix all in one
* test
* more reorganization
* fix compilation error
* fix imports order
* fix dependency
* rename structs
* keep minio params for now
* review comments
* remove manual flag check
* add 45 day expiration to PBAs
* add expiration field to relevant areas, DeleteExpired placeholder
* reject expired BWAs
* test for expired BWAs
* add BwExpiration config value
* fix - Satellite crashing on receiving a manipulated bandwidthagreement
* provider.PeerIdentityFromContext called twice. Remove one
* add storage node ID to serial number
* remove serialNum query and transaction
* add uuid to GeneratePayerBandwidthAllocation for testing
* enable expected failure on duplicate serialnum cases
* Revert "enable expected failure on duplicate serialnum cases"
This reverts commit 5948f43ed1741c280f0bb34a86c1c490365417bc.
* enable expected failure on duplicate serialnum cases
* intial changes to migrate statdb to masterdb framework
* statdb refactor compiles
* added TestCreateDoesNotExist testcase
* Initial port of statdb to masterdb framework working
* refactored statdb proto def to pkg/statdb
* removed statdb/proto folder
* moved pb.Node to storj.NodeID
* CreateEntryIfNotExistsRequest moved pd.Node to storj.NodeID
* moved the fields from pb.Node to statdb.UpdateRequest
ported TestUpdateExists, TestUpdateUptimeExists, TestUpdateAuditSuccessExists TestUpdateBatchExists
* Merge bwagreement db into satellite master db
* adjust to recent tally changes
* linter problems
* linter problems
* returning db structs in more optimal way
* added pointer for assignment
* error message changed
* better param message