Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.
Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.
Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.
Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
A change was made to use a metabase.SegmentKey (a byte slice alias)
as the last seen item to iterate through the irreparable DB in a
for loop. However, this SegmentKey was not initialized, thus it was
nil. This caused the DB query to return nothing, and healthy segments
could not be cleaned out of the irreparable DB.
Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.
In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.
Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.
Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.
After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service
It would require additional refactoring in these two services before we
are able to clean this.
Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.
solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.
Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)
Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
* The audit worker wants to get items from the queue and process them.
* The audit chore wants to create new queues and swap them in when the
old queue has been processed.
This change adds a "Queues" struct which handles the concurrency
issues around the worker fetching a queue and the chore swapping a new
queue in. It simplifies the logic of the "Queue" struct to its bare
bones, so that it behaves like a normal queue with no need to understand
the details of swapping and worker/chore interactions.
Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.
When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.
Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
* Do not swap the active audit queue with the pending audit queue until
the active audit queue is empty.
* Do not begin creating a new pending audit queue until the existing
pending audit queue has been swapped to the active queue.
Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317
It feels weird having a repairer configuration part of order services.
Let's have a single source of truth for it.
Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b
Inside CreateGetRepairOrderLimits we pass in a list of healthy pieces,
but when we query node info from this list we apply the "reliable" filter
again. We sometimes end up with nodes which at first were healthy, but then
became unhealthy, and thus can be repaired, but we do not update the 'unhealthyPieces'
list with these nodes.
This causes an error, 'piece to add already exists', as we fail to remove these
pieces from the pointer before replacing them with repaired pieces.
Change-Id: I6e2445f342ac117ded30351fa7e5e523c9ec26bd
What:
Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.
Why:
github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.
Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and
ecrepairer.putPiece so we can get more accurate estimates of node
performance during repair.
Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d
This will speed up the Put step of repair by not waiting to time out for
a handful of slow nodes, at the expense of a slightly less durable
pointer. It will still repair to the optimal threshold, but not every
node that is selected will end up in the pointer.
Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration
Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)
Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue
Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
* satellite: update log levels
Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f
* log version on start up for every service
Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de
* Use monkit for tracking failed ip resolutions
Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2
* fix compile errors
Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf
* add request limit value to storage node rpc err
Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5
* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders
Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887
* fix build errs
Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.
This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.
Updates tests to test repair for both in memory and to disk.
Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194