We want to eliminate usages of LoopSegmentEntry.Pieces, because it is
costing a lot of cpu time to look up node IDs with every piece of every
segment we read.
In this change, we are eliminating use of LoopSegmentEntry.Pieces in the
node tally observer (both the ranged loop and segments loop variants).
It is not necessary to have a fully resolved nodeID until it is time to
store totals in the database. We can use NodeAliases as the map key
instead, and resolve NodeIDs just before storing totals.
Refs: https://github.com/storj/storj/issues/5622
Change-Id: Iec12aa393072436d7c22cc5a4ae1b63966cbcc18
Adding an interval_end_time column to the accounting_rollups table
to keep the last interval_end_time for each daily storage tallies.
Updates https://github.com/storj/storj/issues/4178
Change-Id: If7a8210c5e9fe2fc9df84b137a8b6e3db2471c58
Currently we have a significant number of tallies that need to be
deleted together. Add a limit (by default 10k) to how many will
be deleted at the same time.
Change-Id: If530383f19b4d3bb83ed5fe956610a2e52f130a1
TestRollupNoDeletes is very flaky (passes locally but fails in the main branch build).
The exact reason is not clear, but stopping the loop seems to be async, the following lines may not stop the loops immediatelly which is a potential problem:
```
satellitePeer.Accounting.Rollup.Loop.Pause()
satellitePeer.Accounting.Tally.Loop.Pause()
```
Fortunatelly these test check only the database interfaces. Instead of testplanet.Run we can run only satellitedbtest.Run which is faster and more predictable (no background loops).
Other potential problem: comment claims that the default of DeleteTallies is false:
```
// In testplanet the setting config.Rollup.DeleteTallies defaults to false.
```
But it seems to be true (rollup.go):
```
DeleteTallies bool `help:"option for deleting tallies after they are rolled up" default:"true"`
```
This is also fixed in the patch (as we need set it explicit), but TBH it can be fixed with testplanet, too.
Change-Id: Id7ec80d5c069bed2c556f4d001c71aa23fc5af23
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
this change isn't the real fix. it's just ignoring the problem.
i don't know what the real fix is. is the problem with the test, or
is there actually a problem with the rollup code?
Change-Id: I552bdd947deadc212cc56efc5f818942b9827126
Remove a declared variable that's set by never read nor passed to any
function so it's unused code.
Change-Id: I8daf9d1f71d29ab39d7a80011d1b4813ada1c67d
this change tries really hard to never have all of the storage node
rollups in memory at the same time, up until the rollups are actually
getting summed together.
Change-Id: If67f49e7d71106798d996a6850b3e48671bd9e18
With the new phase 3 order submission, orders can be added to the
storage and bandwidth rollup tables at timestamps before the most recent
rollup was run. This change shifts the start time of each new rollup
window to account for any unexpired orders that might have been added
since the previous rollup.
A satellitedb migration is necessary to allow upserts in the
accounting_rollups table when entries with identical node_ids and
start_times are inserted.
Change-Id: Ib3022081f4d6be60cfec8430b45867ad3c01da63
It turns out we need to make 2 more changes in order for the new order submission phase 3 to get deployed.
This PR makes 2 changes:
1) when the rollup service deletes tallies, we now keep tallies around until orders expire (vs 1 day like before).
2) the reported rollup chore will now write the storagenode_bandwidth_rollups to a new table _phase2 as an intermediary step so it doesn't conflict with phase 3 order settlement.
These changes need to be deployed for 2 days before we can turn on phase 3 of the new orders settlement workflow.
Change-Id: Iafbff577ba7d55f8f17b7db857311b2ce799de60
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.
graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.
it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.
Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
Remove starting up messages from peers. We expect all of them to start,
if they don't, then they should return an error why they don't start.
The only informative message is when a service is disabled.
When doing initial database setup then each migration step isn't
informative, hence print only a single line with the final version.
Also use shorter log scopes.
Change-Id: Ic8b61411df2eeae2a36d600a0c2fbc97a84a5b93
In satellite/accounting/rollup Service.RollupStorage we have a few
potential error scenarios that return time.Now(). Especially in the case
where we exit early because we have received 0 tallies since the *last*
rollup, this creates a potential race condition.
Between the time we call GetTalliesSince and realize it is empty, it's
possible a tally was inserted in that interval. As currently written we
are returning a latestTally time that excludes that tally.
We are currently protected because in Service.Rollup we don't save the
rollup unless we have populated the rollupStats. However, this change is
more correct and future-proof, because Service.RollupStorage should
always return a correct latestTally time, which in case of errors and
empty tallies, is the last successful tally.
Change-Id: I2521a2cc9802c8f06e512dde4422803a272e2a0a
* Added batch update stats for recordAuditSuccessStatus
* Added batch update stats to recordAuditFailStatus
* added configurable batch size
* build individual update/delete statements so the statements can be batched into 1 call to the DB
* notified #config-changes channel and ran make update-satellite-config-lock
* updated tests to use batch update stats
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair