this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
We move PathCipher to encryption.Store and we need to adjust
storj/uplink for those changes. Uplink repo is also using libuplink to
run tests so we need first adjust storj/storj libuplink and later
storj/uplink.
Change-Id: I84f23e6bad18ac139f72c19939dc526f9f46d88b
this is to help protect against intentional or unintentional
slowloris style problems where a client keeps a tcp connection
alive but never sends any data. because grpc is great, we have
to spawn a separate goroutine for every read/write to the stream
so that we can return from the server handler to cancel it if
necessary. yep. really.
additionally, we update the rpcstatus package to do some stack
trace capture and add a Wrap method for the times where we want
to just use the existing error.
also fixes a number of TODOs where we attach status codes to the
returned errors in the endpoints.
Change-Id: Id8bb8ff84aa34e0f711b0cf9bce3908b36a1d3c1
This reverts commit 8e242cd012.
Revert because lib/pq has known issues with context cancellation.
These issues need to be resolved before these changes can be merged.
Change-Id: I160af51dbc2d67c5449aafa406a403e5367bb555
this will allow for some nice runtime analysis down the road.
also, this allows for wrapping database handles in a way that
can interact with these contexts
requires https://review.dev.storj.io/c/storj/dbx/+/514
Change-Id: Ib087b7cd73296dd2c1e0331314da34d861f61d2b
Adds check to see if storage nodes are eligible to initiate
graceful exit, by checking their CreatedAt date and seeing if
their "age" is greater than the new config value:
NodeMinAgeInMonths
The default for this value is 6 months for now.
https://storjlabs.atlassian.net/browse/V3-3357
Change-Id: Ib807ab8987ddb5a38a27a83886490f73fe8c5816
these may not be optimal but they're probably better based on
our previous testing. we can tune better in the future now that
the groundwork is there.
Change-Id: Iafaee86d3181287c33eadf6b7eceb307dda566a6
* separate sadb migration, add version check
* update checkversion to do same validation as migration
* changes per CR
* add sa migration to storj-sim
* add different debug port in storj-sim for migration
* add wait for exit for storj-sim migration
* update sa docker entrypoint to support migration
* storj-sim satellite parts all wait for migration
* upgrade golang-migrate/migrate to v4 because bug
* fix go mod tidy
keep a pool of connections open when dialing for drpc. this
makes it so that long lived clients (like lib/uplink's Project)
don't continue to use a bad connection forever. it also allows
for concurrent rpcs.
Change-Id: If649b286050e4f09c413fadc3e1ce88f5fc6e600
all of the packages and tests work with both grpc and
drpc. we'll probably need to do some jenkins pipelines
to run the tests with drpc as well.
most of the changes are really due to a bit of cleanup
of the pkg/transport.Client api into an rpc.Dialer in
the spirit of a net.Dialer. now that we don't need
observers, we can pass around stateless configuration
to everything rather than stateful things that issue
observations. it also adds a DialAddressID for the
case where we don't have a pb.Node, but we do have an
address and want to assert some ID. this happened
pretty frequently, and now there's no more weird
contortions creating custom tls options, etc.
a lot of the other changes are being consistent/using
the abstractions in the rpc package to do rpc style
things like finding peer information, or checking
status codes.
Change-Id: Ief62875e21d80a21b3c56a5a37f45887679f9412
The fundamental problem is that both drpc and grpc servers
want to close the listener and they both want to ignore the
error from Accept after the listener is closed. There's no
way to do this in a race free way. Fortunately, the mux
hands out listeners that can be independently closed. That
means they can both do their own shutdown logic where they
ignore the error, and then after they're closed, the code
orchestrating the servers can close the listeners.
The final weird bit is that the server's Close method is
required to wait until the Run method has exited (or at
least enough for the listeners to definitely be closed)
because tests depend on that behavior, so we have to add
some channels/mutexes/onces to ensure that Run has exited
and that a new call can't start after Close is called.
Change-Id: I7c4ef293f7963f83138815f51824fd5b8d09ce15