* pkg/datarepair/repairer: Track always time for repair
Make a minor change in the worker function of the repairer, that when
successful, always track the metric time for repair independently if the
time since checker queue metric can be tracked.
* storage/postgreskv: Wrap error in Get func
Wrap the returned error of the Get function as it is done when the
query doesn't return any row.
* satellite/metainfo: Move debug msg to the right place
NewStore function was writing a debug log message when the DB was
connected, however it was always writing it out despite if an error
happened when getting the connection.
* pkg/datarepair/repairer: Wrap error before logging it
Wrap the error returned by process which is executed by the Run method
of the repairer service to add context to the error log message.
* pkg/datarepair/repairer: Make errors more specific in worker
Make the error messages of the "worker" method of the Service more
specific and the logged message for such errors.
* pkg/storage/repair: Improve error reporting Repair
In order of improving the error reporting by the
pkg/storage/repair.Repair method, several errors of this method and
functions/methods which this one relies one have been updated to be
wrapper into their corresponding classes.
* pkg/storage/segments: Track path param of Repair method
Track in monkit the path parameter passed to the Repair method.
* satellite/satellitedb: Wrap Error returned by Delete
Wrap the error returned by repairQueue.Delete method to enhance the
error with a class and stack and the
pkg/storage/segments.Repairer.Repair method get a more contextualized
error from it.
Create a new variable rather than reusing the existing one because the
name of the existing one is confusing when reading the logic and it
requires more time that the logic doesn't have a bug.
* pkg/datarepair: Add test to check num upload pieces
Add a new test for ensuring the number of pieces that the repair process
upload when a segment is injured.
* satellite/orders: Don't create "put order limits" over total
Repair must not create "put order limits" more than the total count.
* pkg/datarepair: Update upload repair pieces test
Update the test which checks the number of pieces which are uploaded
during a repair for using the same excess over the success threshold
value than the implementation.
* satellites/orders: Limit repair put order for not being total
Limit the number of put orders to be used by repair for only uploading
pieces to a % excess over the successful threshold.
* pkg/datarepair: Change DataRepair test to pass again
Make some changes in the DataRepair test to make pass again after the
repair upload repaired pieces only until a % excess over success
threshold.
Also update the steps description of the DataRepair test after it has been
changed, to match on what's now, besides to leave it more generic for
avoiding having to update it on minimal future refactorings.
* satellite: Make repair excess optimal threshold configurable
Add a new configuration parameter to the satellite for being able to
configure the percentage excess over the optimal threshold, used for
determining how many pieces should be repaired/uploaded, rather than
having the value hard coded.
* repairer: Add configurable param to segments/repairer
Add a new parameters to the segment/repairer to calculate the maximum
number of excess nodes, based on the optimal threshold, that repaired
pieces can be uploaded.
This new parameter has been added for not returning more nodes than the
number of upload orders for data repair satellite service calculate for
repairing pieces.
* pkg/storage/ec: Update log message in clien.Repair
* satellite: Update configuration lock file
* Disabled discovery service by changiing from Stop() to Pause()
Paused to solve race condition. If discovery is running, it may mark a node "up" after they've been manually marked "down" in this test.
* Extend to the repair timeout
Fixes intermittent test failures when repairs were taking more than 2 seconds.
* Re-enabled test. Disabled discovery service by changiing from Stop() to Pause()
* Changed back to Stop.
* Revert "Changed back to Stop."
This reverts commit 46d410e72dfae63e0c44915be42784cc9a7b5abf.
* re-enabling TestIdentifyInjuredSegments
* Changed Pause to Stop. Commented on timeout change
* testing...
* temporarily skipping audit tests
* changing back to discover Stop for testing via jenkins
* Revert "changing back to discover Stop for testing via jenkins"
This reverts commit 6aa8558b11a0053c30e0c8b2dbf0d6c0cb34ee6c.
* Changing back to Stop(). Depends on PR 2137
* Revert "temporarily skipping audit tests"
This reverts commit 1940ed9b315d663a0eb6c95521780cbcb48cb121.
* Removed reference to Graveyard since its been removed
* added scopelint and correcte issues found
* corrected scopelint issue
* made updates based on Ivan's suggestions
Most were around naming conventions
Some were false positives, but I kept them since the test.Run could eventually be changed to run in parallel, which could cause a bug
Others were false positives. Added // nolint: scopelint
* first round cleanup based on go-critic
* more issues resolved for ifelsechain and unlambda checks
* updated from master and gocritic found a new ifElseChain issue
* disable appendAssign. i reports false positives
* re-enabled go-critic appendAssign and disabled lint check at code line level
* fixed go-critic lint error
* fixed // nolint add gocritic specifically
* add repair monkit stats
* rename values, use meter instead of counter, use success threshold instead of repair threshold
* Counter -> Meter
* add repair segment size
* update names and use ratios for healthy before/after repair
* restart jenkins
* repair no cutoff longtail
* commit repair pieces even if not hitting success threshold
* commit repair pieces even if not hitting success threshold
* remove useless condition
* better error message
We want to use those fields in the bucket-level Pointer objects as
bucket defaults, but we need to be able to get at them first.
I don't see any strong reason not to make these available, except
that it was kind of a pain.
* psclient receives storage node hash and compare it to own hash for verification
* uplink sends delete request when hashes don't match
* valid hashes are propagated up to segments.Store for future sending to satellite
Removes most instances of pb.SignedMessage (there's more to take out but they shouldn't hurt anyone as is).
There used to be places in psserver where a PieceID was hmac'd with the SatelliteID, which was gotten from a SignedMessage. This PR makes it so some functions access the SatelliteID from the Payer Bandwidth Allocation instead.
This requires passing a SatelliteID into psserver functions where they weren't before, so the following proto messages have been changed:
* PieceId - satellite_id field added
This is so the psserver.Piece function has access to the SatelliteID when it needs to get the namespaced pieceID.
This proto message should probably be renamed to PieceRequest, or a new PieceRequest message should be created so this isn't misnamed.
* PieceDelete - satellite_id field added
This is so the psserver.Delete function has access to the SatelliteID when receiving a request to Delete.