* debug
* traces
* cfgstruct
* process
Package `storj/private/version` will be removed as a separate change.
Change-Id: Iadc40faa782e6225513b28218952f02d9c240a9f
The goal of this change is to improve the storagenode_storage_tallies table by removing the unneeded id column that is not being used but only taking up space, and also to add an index on a different column that needs it. Removing and adding a column seems simple, but ended up being more complicated because of some cockroachdb limitations.
The cockroachdb limitation when trying to remove a column from a table and create a new primary key are:
1. only allows primary key creation at table creation time (docs: https://www.cockroachlabs.com/docs/stable/primary-key.html)
2. table drop or rename is performed async and cannot be done in a transaction (issue: https://github.com/cockroachdb/cockroach/issues/12123, https://github.com/cockroachdb/cockroach/issues/22868)
To address these differences between cockroachdb and Postgres, this PR performs different migrations for the two database. The Postgres migration is straight forward and what you would expect, but the cockroach migration has two main changes:
1. To change a primary key, use the recommended process from the cockroachdb docs to create a new table with the new primary key you want and then migrate the data.
2. In order to do 1, we needed to do the new table renaming in a separate transaction from the data migration.
Ref: SM-65
Change-Id: Idc9aee3ab57aa4d5570e3d2980afea853cd966bf
This implements a service for pointer verification. This makes the
slightly clearer, because it's not part of metainfo.
It also adds a peer identity cache which reduces database calls and peer
identity decoding.
Change-Id: I45da40460d579c6f5fd74c69bccea215157aafda
step 1 in https://review.dev.storj.io/c/storj/uplink/+/1236
Now the old libuplink uses the temporary DeleteBucketReturnDeleted and
DeleteObjectReturnDeleted methods. This way, in the next step, we will
be able to change the DeleteBucket and DeleteObject methods to return
the deleted bucket/object.
Change-Id: I2e638be1960bca6ce1456c92849fcdd6d93e5252
by doing an indexed anti-join we're able to reduce the time to
select the pending orders by over 10x on postgres. this should
help us process pending orders much more quickly.
it probably won't do as good a job on cockroach because it does
not do an indexed anti-join and instead does a hash join after
scanning the entire consumed serials table. we should either
remove orders entirely or try to make that more efficient
when necessary.
Change-Id: I8ca0535acd21c51e74955b24c9b86d20e4f2ff9c
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair
This commit also removes unused overlay functionality.
Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).
Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states
Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
Initial change for checking bucket existence on satellite side for
requests like BeginObject and ListObjects. This is simple implementation
that is just checking bucket in DB but should be improved in future to
avoid DB calls as much as possible.
Part of https://storjlabs.atlassian.net/browse/USR-365
Change-Id: I9076acddc44d7dbfa7612a1c24a007de01621583
This adds a piece deletion handler that has debounce for failed dialing
and batching multiple jobs into a single request.
Change-Id: If64021bebb2faae7f3e6bdcceef705aed41e7d7b
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
To handle concurrent deletion requests we need to combine them into a
single request.
To implement this we introduces few concurrency ideas:
* Combiner, which takes a node id and a Job and handles combining
multiple requests to a single batch.
* Job, which represents deleting of multiple piece ids with a
notification mechanism to the caller.
* Queue, which provides communication from Combiner to Handler.
It can limit the number of requests per work queue.
* Handler, which takes an active Queue and processes it until it has
consumed all the jobs.
It can provide limits to handling concurrency.
Change-Id: I3299325534abad4bae66969ffa16c6ed95d5574f
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address
This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base
Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
We missed this in the migration that added the num_healthy_pieces
column. It exists in dbx, but not on the actual satellite table.
Change-Id: If16b5ec2325d56406250298531b3285215188bf3
Metainfo method validateAuth checks things like API key, user permission
and rate limit but at the end all errors were returned as
rpcstatus.Unauthenticated.
Old Metainfo is not touched to avoid backward compatibility issues.
Change-Id: I78eb276210fc50151da58a5c84e13ecd0961da29
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).
Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.
This change also better differentiates some error cases from Repair()
for monitoring purposes.
Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
New API has limited number of options to configure at the moment. We
should remove unused flags from Uplink CLI and add if needed in the
future.
Change-Id: Icf3f3dadd43cb61a3b408b02d0762aef34425dbf
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.
Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.
Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'
Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.
https://storjlabs.atlassian.net/browse/SM-104
Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
- Previously, checkSegmentAltered only checked for segments that were replaced
but we want to detect all changes to a segment that occurred while an audit was being conducted.
- Fixed a bug where nodes failing audits during reverify for non-piece-hash-verified
segments were not being removed from containment mode.
- Filled in gaps in reverify testing to ensure nodes are properly removed from containment.
Change-Id: Icd96d369278987200fd28581395725438972b292
The billing tests were flaky because some assertions ran before the
storage nodes finish their work.
A new helper function in testplanet has been added to allow to wait for
storage nodes endpoints to finish their work. This function now it's
used in the billing tests for avoiding their flakiness.
This commit closes the ticket:
https://storjlabs.atlassian.net/browse/SM-403
A part of fixing other billing tests flakiness.
Change-Id: Iacb750af435f515c04b1e1d3510a218d184c9abc
uplinks
New libuplink is not storing encryption values in with bucket but old
uplinks are using those values for configuration. If bucket was created
with new libuplink we will send back satellite defaults.
Change-Id: Ie1bf3682847e07b302270b4c4bf1a7219f4bf011