In an effort to distribute load on the reputation database, the
reputation write cache scheduled nodes to be written at a time offset by
the local nodeID. The idea was that no two repair workers would have the
same nodeID, so they would not tend to write to the same row at the same
time.
Instead, since all satellite processes share the same satellite ID
(duh), this caused _all_ workers to try and write to the same row at the
same time _always_. This was not ideal.
This change uses a random number instead of the satellite ID. The random
number is sourced from the number of nanoseconds since the Unix epoch.
As long as workers are not started at the exact same nanosecond, they
ought to get well-distributed offsets.
Change-Id: I149bdaa6ca1ee6043cfedcf1489dd9d3e3c7a163
This will apply an appropriate "subsystem" label to goroutines which are
part of the core, api, repairer, admin, or gc subsystems.
It will also label goroutines whose job it is to watch for slow shutdown
of lifecycle groups (there are a lot of these).
Finally, this will also label goroutines whose job it is to wait on the
toplevel errgroup of a subsystem.
Change-Id: I560b5fff4a0101300d6c9a67609c2d80d7424486
The main motivation is to wrap the bucket DB and metainfo DB, so we
could check if a bucket is empty before applying geofencing config.
Change-Id: I8bac21555e01d51a663fb557bc1acfc8106bc2e1
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.
Later we still need to drop table with migration step.
Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.
Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.
Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.
The new format for RS override values is
"k/o/n-override,k/o/n-override..."
Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
It feels weird having a repairer configuration part of order services.
Let's have a single source of truth for it.
Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b
See https://storjlabs.atlassian.net/browse/SM-752
These changes allow us to change the log level at runtime through a handler off of the debug endpoint.
Examples of changing the log level on storj-sim
To get the current level for the satellite api process:
curl -XGET 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain'
To change the log level:
curl -XPUT 'http://127.0.0.1:10009/logging' --header 'Content-Type: text/plain' --data-raw '{"level":"error"}'
Change-Id: I05d164b290929fa06b6d78c01075ee41f8238044
* satellite: update log levels
Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f
* log version on start up for every service
Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de
* Use monkit for tracking failed ip resolutions
Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2
* fix compile errors
Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf
* add request limit value to storage node rpc err
Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5
* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders
Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887
* fix build errs
Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.
This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.
Updates tests to test repair for both in memory and to disk.
Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
* debug
* traces
* cfgstruct
* process
Package `storj/private/version` will be removed as a separate change.
Change-Id: Iadc40faa782e6225513b28218952f02d9c240a9f
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).
Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.
This change also better differentiates some error cases from Repair()
for monitoring purposes.
Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
Currently we risk losing pending bandwidth rollup writes even on a clean
shutdown. This change ensures that all pending writes are actually
written to the db when shutting down the satellite.
Change-Id: Ideab62fa9808937d3dce9585c52405d8c8a0e703
Remove starting up messages from peers. We expect all of them to start,
if they don't, then they should return an error why they don't start.
The only informative message is when a service is disabled.
When doing initial database setup then each migration step isn't
informative, hence print only a single line with the final version.
Also use shorter log scopes.
Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93
* set up satellite repair run command
* add separated repair process to storj-sim
* add repairer peer to satellite in testplanet
* move api run cmd into api.go
* add satellite run repair to entrypoint