size with BeginObject
Such solution will add one round trip to satellite during upload so for
now we are reverting this until we will have solution for this.
Change-Id: Ic2d826448ab7b0318cd6922df05deee9167cf2f0
BeginObject response
We want to control inline segment size and segment size on satellite
side. We need to return such information to uplink like with redundancy
scheme.
Change-Id: If04b0a45a2757a01c0cc046432c115f475e9323c
uuid.UUID implements driver.Value so it can be directly used as a
scannable result.
Replace uses of dbutil.BytesToUUID with uuid.FromBytes.
Change-Id: I51a670185ceb3cc2199d5aa2b76bc3fc191ca8fe
This will help to determine how many grpc calls are made to the
satellite.
Also remove the grpc funcs that have been added to upstream.
Change-Id: I91878f4fd10f9bfe601c94222c102eaaf4d35963
* debug
* traces
* cfgstruct
* process
Package `storj/private/version` will be removed as a separate change.
Change-Id: Iadc40faa782e6225513b28218952f02d9c240a9f
Switch back to the original DeleteBucket and DeleteObject methods.
Next step: remove the DeleteBucketReturnDeleted and
DeleteObjectReturnDeleted from storj.io/uplink.
Change-Id: I273a305326d411e51ce354ce72fcc6ecadf4dd5f
step 1 in https://review.dev.storj.io/c/storj/uplink/+/1236
Now the old libuplink uses the temporary DeleteBucketReturnDeleted and
DeleteObjectReturnDeleted methods. This way, in the next step, we will
be able to change the DeleteBucket and DeleteObject methods to return
the deleted bucket/object.
Change-Id: I2e638be1960bca6ce1456c92849fcdd6d93e5252
The admission/v3 protocol now supports arbitrary key/value headers to be
included in each packet of metrics. This commit creates support for
this, so the lua config file can declare a filter taking into account
the key/value headers.
Change-Id: I41de8c018d33304ccf46ec221ae689d55c5fb1ee
When a storagenode begins to run low on capacity, we want to notify
the satellite before completely running out of space. To achieve this,
at the end of an upload request, the SN checks if its available space has
fallen below a certain threshold. If so, trigger a notification to the
satellites.
The new NotifyLowDisk method on the monitor chore is implemented using the
common/syn2.Cooldown type, which allows us to execute contact only once
within a given timeframe; avoiding hammering the satellites with requests.
This PR contains changes to the storagenode/contact package, namely moving
methods involving the actual satellite communication out of Chore and into
Service. This allows us to ping satellites from the monitor chore
Change-Id: I668455748cdc6741291b61130d8ef9feece86458
common/pb moved grpc to a separate package common/pb/pbgrpc.
This updates this repository to use it.
Change-Id: I2de2a190688871cf9cb61f7ea511f8a01e264e4e
Now that we are trying to identify the root cause of the satellite load limitations (i.e. currently the satellite has a max ability of 400 rps for uploads and we need this to be higher), we are using the golang diagnostic tools to collect insight into what the bottlenecks are. We currently have a debug endpoint to gather some cpu and mem data, but it could be useful to have continuous profiling. GCP stackdriver has support for continuous profiling so lets set that up and see if it is helpful to gather more data.
This PR adds support for [GCP continuous profiler](https://cloud.google.com/profiler) which allows enabling continuous cpu/mem profiling and the stats are sent to stackdriver in google cloud console.
To enable the continuous profiling for a storj component, do the following:
- prereq: the workload must be running in GKE and have Stackdriver Profiling IAM role permissions
- provide the config flag `debug.profilename` in the config.yaml file for the workload (i.e. satellite api process, etc). The profilename should be the workload name, for example "satellite-api".
- once the above config flag is provided, the profiler will be initialized and profiling stats will automatically be sent to GCP project where the workload is running and viewable in the Stackdriver Profile page in the console
The current implementation assumes the workload is running in GKE, however if we find if useful we can add support to enable this from anywhere. But for simplicity, its configured this way assuming the main goal is to enable in production systems.
Change-Id: Ibf8ebe2df7bf06fdd4951ee6a1e48854dd36ad47
With commit: 3331b443e7, satellite will
start calling `DeletePieces`. Therefore, we can remove the old endpoint
once the above commit is deployed with all satellites
Change-Id: I0124bc00a7cb808d119eb59f8fcd7fadf68158bb
we want to return back to the user as quick as possible but also keep
deleting remaining pieces on the storagenodes
Change-Id: I04e9e7a80b17a8c474c841cceae02bb21d2e796f
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
We move PathCipher to encryption.Store and we need to adjust
storj/uplink for those changes. Uplink repo is also using libuplink to
run tests so we need first adjust storj/storj libuplink and later
storj/uplink.
Change-Id: I84f23e6bad18ac139f72c19939dc526f9f46d88b
this is to help protect against intentional or unintentional
slowloris style problems where a client keeps a tcp connection
alive but never sends any data. because grpc is great, we have
to spawn a separate goroutine for every read/write to the stream
so that we can return from the server handler to cancel it if
necessary. yep. really.
additionally, we update the rpcstatus package to do some stack
trace capture and add a Wrap method for the times where we want
to just use the existing error.
also fixes a number of TODOs where we attach status codes to the
returned errors in the endpoints.
Change-Id: Id8bb8ff84aa34e0f711b0cf9bce3908b36a1d3c1
This reverts commit 8e242cd012.
Revert because lib/pq has known issues with context cancellation.
These issues need to be resolved before these changes can be merged.
Change-Id: I160af51dbc2d67c5449aafa406a403e5367bb555
this will allow for some nice runtime analysis down the road.
also, this allows for wrapping database handles in a way that
can interact with these contexts
requires https://review.dev.storj.io/c/storj/dbx/+/514
Change-Id: Ib087b7cd73296dd2c1e0331314da34d861f61d2b
Adds check to see if storage nodes are eligible to initiate
graceful exit, by checking their CreatedAt date and seeing if
their "age" is greater than the new config value:
NodeMinAgeInMonths
The default for this value is 6 months for now.
https://storjlabs.atlassian.net/browse/V3-3357
Change-Id: Ib807ab8987ddb5a38a27a83886490f73fe8c5816
these may not be optimal but they're probably better based on
our previous testing. we can tune better in the future now that
the groundwork is there.
Change-Id: Iafaee86d3181287c33eadf6b7eceb307dda566a6
* separate sadb migration, add version check
* update checkversion to do same validation as migration
* changes per CR
* add sa migration to storj-sim
* add different debug port in storj-sim for migration
* add wait for exit for storj-sim migration
* update sa docker entrypoint to support migration
* storj-sim satellite parts all wait for migration
* upgrade golang-migrate/migrate to v4 because bug
* fix go mod tidy
keep a pool of connections open when dialing for drpc. this
makes it so that long lived clients (like lib/uplink's Project)
don't continue to use a bad connection forever. it also allows
for concurrent rpcs.
Change-Id: If649b286050e4f09c413fadc3e1ce88f5fc6e600
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.
most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.
a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.
Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
The fundamental problem is that both drpc and grpc servers
want to close the listener and they both want to ignore the
error from Accept after the listener is closed. There's no
way to do this in a race free way. Fortunately, the mux
hands out listeners that can be independently closed. That
means they can both do their own shutdown logic where they
ignore the error, and then after they're closed, the code
orchestrating the servers can close the listeners.
The final weird bit is that the server's Close method is
required to wait until the Run method has exited (or at
least enough for the listeners to definitely be closed)
because tests depend on that behavior, so we have to add
some channels/mutexes/onces to ensure that Run has exited
and that a new call can't start after Close is called.
Change-Id: I7c4ef293f7963f83138815f51824fd5b8d09ce15