inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
Since we are sharing the reporting logic between repair and audit. We
need to remove metric reporting logic in reporter.
Change-Id: Ib87295ab19079329e7438327d785a7f5c21d3b21
When a node gets enough timeouts, it is supposed to be removed
from pending_audits and get an audit failure. We would give them
a failure, but we missed the removal. This change fixes it.
Change-Id: I2f7014e28d7d9b01a9d051f5bbb4f67c86c7b36b
package in audit
This PR implements reputation store and replace overlay in audit service
to use such store for storing node's audit stats.
In order to keep the changeset smaller, most of the changes in this PR is for copying audit logic in overlay to
reputation package. In a following PR, the duplicating code will be
removed from overlay.
Change-Id: I16c12494a0970f44c422b26cf603c1dc489e5bc1
Currently, pending audit is finding segment by segment location
(path) because we want to move audit to segmentloop and we will
have only StreamID and Position we need to add columns for those
fields. Altering existing table can cause issues while
migration and deployment. Cleaner choise is to make new table.
This change contains migration with new segment_pending_audit
table that will replace pending_audits table and adjustments
to use new table in the code.
Table pending_audits will be dropped with next release.
Change-Id: Id507e29c152da594bac1fd812c78d7ecf45ec51f
Among other conditions, nodes fail audits by returning incorrect
data and by reaching the max reverify count, but we weren't logging
these events. This commit adds the missing logs.
Change-Id: I80749a7e95e8cb97bc8dd7dac1e523e223114b7f
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.
Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
* skip unknown errors (wip)
* add tests to make sure nodes that time out are added to containment
* add bad blobs store
* call "Skipped" "Unknown"
* add tests to ensure unknown errors do not trigger containment
* add monkit stats to lockfile
* typo
* add periods to end of bad blobs comments
* Added batch update stats for recordAuditSuccessStatus
* Added batch update stats to recordAuditFailStatus
* added configurable batch size
* build individual update/delete statements so the statements can be batched into 1 call to the DB
* notified #config-changes channel and ran make update-satellite-config-lock
* updated tests to use batch update stats
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair