This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.
This is a move from Github PR #3645.
Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
With commit: 3331b443e7, satellite will
start calling `DeletePieces`. Therefore, we can remove the old endpoint
once the above commit is deployed with all satellites
Change-Id: I0124bc00a7cb808d119eb59f8fcd7fadf68158bb
Curently, storage nodes only report their capacity to satellites
once per hour. If a node fills up, it will fail all uploads until
the next contact cycle begins. With these changes, at the end of an
upload we check whether the MinimumDiskSpace threshold has been
passed. If so, trigger the monitor chore to update the node's
capacity, then trigger the contact chore to report the new
capacity to the satellites
Change-Id: Ie6aadaade1e2c12c87e03f8ff9059a50121380a0
Currently SNs report their free disk space once per hour. If a node
becomes full, it has to wait until the next contact cycle begins to
report; all the while receiving and failing upload requests. By increasing
the minimum required disk space, we can give the storage nodes more time
to report their space before the completely fill up. This change goes
hand-in-hand with another change we want to implement: trigger capacity
report on SN immediately upon falling below threshold.
Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27
This peer will contain our administrative panels.
It's completely separated from our other satellite
processes because it allows better control for restricting
access to it.
Change-Id: Ifca473bee82ff6c680b346918ba32b835a7a6847
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
it was noticed that if you had a long lived transaction A that
was blocking some other transaction B and A was being aborted
due to retriable errors, then transaction B was never given
priority. this was due to using savepoints to do lightweight
retries.
this behavior was problematic becaue we had some queries blocked
for over 16 hours, so this commit addresses the issue with two
prongs:
1. bound the amount of time we will retry a transaction
2. create new transactions when a retry is needed
the first ensures that we never wait for 16 hours, and the value
chosen is 10 minutes. that should be long enough for an ample
amount of retries for small queries, and huge queries probably
shouldn't be retried, even if possible: it's more preferrable to
find a way to make them smaller.
the second ensures that even in the case of retries, queries that
are blocked on the aborted transaction gain priority to run.
between those two changes, the maximum stall time due to retries
should be bounded to around 10 minutes.
Change-Id: Icf898501ef505a89738820a3fae2580988f9f5f4
We move PathCipher to encryption.Store and we need to adjust
storj/uplink for those changes. Uplink repo is also using libuplink to
run tests so we need first adjust storj/storj libuplink and later
storj/uplink.
Change-Id: I84f23e6bad18ac139f72c19939dc526f9f46d88b
Allow rate limit project cache to expire so we can make project level rate limit changes without restarting the satellite process.
Change-Id: I159ea22edff5de7cbfcd13bfe70898dcef770e42
This ought to make it so that all single statements (Exec- or Query-) on
a CockroachDB backend will get retried as necessary. As there is no need
for savepoints to be allocated or released in this case, there is no
round-trip overhead except when statements actually do need to be
retried.
Change-Id: Ibd7f1725ff727477c456cb309120d080f3cd7099
We now close databases in testplanet in reverse order, knowing that some
caches and other objects need to close prior to the underlying db. Some
dbs were not being added near the list of closeable databases near their
creation, causing an issue with shutdown order.
Change-Id: I23391f4d77649030493e47bd7169002a72b3bf7a
this is to help protect against intentional or unintentional
slowloris style problems where a client keeps a tcp connection
alive but never sends any data. because grpc is great, we have
to spawn a separate goroutine for every read/write to the stream
so that we can return from the server handler to cancel it if
necessary. yep. really.
additionally, we update the rpcstatus package to do some stack
trace capture and add a Wrap method for the times where we want
to just use the existing error.
also fixes a number of TODOs where we attach status codes to the
returned errors in the endpoints.
Change-Id: Id8bb8ff84aa34e0f711b0cf9bce3908b36a1d3c1
Since we have caches on top of databases and they are included in the
databases list, we need to shut them down in-reverse order to avoid
issues with flushing to a closed database.
Change-Id: I3f23a527a2a5425638b1a7e2cab84741f019d493
We don't do a lot of panicking in our main code, so hopefully this won't
matter much, but we /do/ call panic a lot in our tests (t.Fatal,
require.NoError, etc). And when that happens, we need pending
transactions to be aborted or we can get into a deadlock situation when
something else tries to /Close/ that connection.
Change-Id: Idaf0d543ac95afea34f9b2393d1187f5322e9f0f
Currently we risk losing pending bandwidth rollup writes even on a clean
shutdown. This change ensures that all pending writes are actually
written to the db when shutting down the satellite.
Change-Id: Ideab62fa9808937d3dce9585c52405d8c8a0e703
Currently Cockroach DB setup takes a significant amount of time.
This flattens the database setup into a single query,
which improves the test time significantly.
The migration tests still test each migration separately.
Change-Id: Iaca16f34a6af3926fa2b5ebf618f939fd59460b3
With this change RS configuration will be set on satellite. Uplink with
get RS values with BeginObject request and will use it. For backward
compatibility and to avoid super large change redundancy scheme stored
with bucket is not touched. This can be done in future.
Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff
Limits how many times metainfo APIs can be called per second by project ID. If limit is exceeded, the API will return Unauthorized/Too Many requests.
Limit per second and the size of the limiter cache per project are configurable, as well as whether the limiter is enabled.
Tests added/updated for the new rate_limit field in projects table.
Tests added for exceeding limits and disableing limiter.
Change-Id: Ic8ad102de3b690a475809d4f684156d5715f20fa
Replace all the remaining uses of sql.DB with tagsql.DB to
fix issues with context cancellation.
Introduce tagsql.Open which helps to get rid of all tagsql.Wrap-s.
Use tagsql in cockroachkv and postgreskv.
Change-Id: I8946d203341cb85a25976896fc7881e1f704e779
Not having a skew caused an issue where:
1. Uplink calls "begin segment", where segment isn't committed to the
database.
2. Uplink stores piece X to the storage node A with timestamp 1.
3. Satellite runs garbage collection with timestamp 2.
4. Satellite sends retain request to storage node A with timestamp 2.
5. Storage node A deletes piece X, because 1 < 2.
6. Uplink calls "commit segment" with storage node A in it.
7. Download of segment fails, because A doesn't have piece X.
In production this is not an issue since the MaxTimeSkew is 72h by
default.
Change-Id: Id87ca3ddc44103dcd85d031b1367168c014b8e7b
Also added temporary types withRebind and withTagTx,
which will be later removed. Currently they help to avoid
changing the whole codebase at the same time.
Change-Id: I7f07ba8f4709a23a463bfa67464628665a05808f
dbschema.Query is used only for testing and sqlite,
so this won't cause us problems in production.
Change-Id: Ib296a7daf161a9d3de23a7dfdc4f505d47ac4a37
storagenode database preflight check.
Disable preflight database check by default, and have the option to
enable it. This will allow us to enable it once it is definitely
working.
Also change the name of the config flag for preflight time sync.
Change-Id: Ie2e20f9e25dcb38794eafa7e1505e7c6ff287c99
live accounting used to be a cache to store writes before they are picked up during
the tally iteration, after which the cache is cleared. This created a window in which
users could potentially exceed the storage limit. This PR refactors live accounting to
hold current estimations of space used per project. This should also reduce DB load
since we no longer need to query the satellite DB when checking space used for limiting.
The mechanism by which the new live accounting system works is as follows:
During the upload of any segment, the size of that segment is added to its respective
project total in live accounting. At the beginning of the tally iteration we record
the current values in live accounting as `initialLiveTotals`. At the end of the tally
iteration we again record the current totals in live accounting as `latestLiveTotals`.
The metainfo loop observer in tally allows us to get the project totals from what it
observed in metainfo DB which are stored in `tallyProjectTotals`. However, for any
particular segment uploaded during the metainfo loop, the observer may or may not
have seen it. Thus, we take half of the difference between `latestLiveTotals` and
`initialLiveTotals`, and add that to the total that was found during tally and set that
as the new live accounting total.
Initially, live accounting was storing the total stored amount across all nodes rather than
the segment size, which is inconsistent with how we record amounts stored in the project
accounting DB, so we have refactored live accounting to record segment size
Change-Id: Ie48bfdef453428fcdc180b2d781a69d58fd927fb