Commit Graph

67 Commits

Author SHA1 Message Date
Erik van Velzen
db1cc8ca95 satellite/repair/checker: buffer repair queue
Integrate previous changes. Speed up the segment loop by batch inserting
into repair queue.

Change-Id: Ib9f4962d91960d21bad298f7771345b0dd270276
2022-05-12 16:28:05 +00:00
Cameron Ayer
a7cda642a5 satellite/repair: add logging for irreparable segments in checker
If the checker sees an irreparable segment, log out some info
so we can see what the problem is

Change-Id: I76eda5270214205f4fefc646d6c391cc13ddcafd
2021-09-02 12:35:29 -04:00
Clement Sam
1f353f3231 segment/{metabase,repair}: change segment created_at column to not accept nulls
This change adds a NOT NULL constraint to the created_at column in the segment table.
All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc)

Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52
2021-08-06 08:16:28 +00:00
Michał Niewrzał
0d8e7905c1 satellite/repair/checker: don't return error when joining loop
Error from joining loop should not restart satellite. This will be the
same error like for loop itself. In the same way we are handling joining
error for other services that are using segment loop.

Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e
2021-08-03 12:56:42 +00:00
Michał Niewrzał
a883d7f582 satellite/repair/checker: fix remote_files_checked metric
While metaloop refactoring we missed metric for all
objects processed by repair checker.

Change-Id: I100f10a36c52e2651923ecaa377261752877d673
2021-07-22 14:48:08 +00:00
Michał Niewrzał
b900f6b4f9 satellite/repair/checker: move checker to segment loop
Change-Id: I04b25e4fa14c822c9524586c25bde89db2a6cad9
2021-07-01 13:51:56 +00:00
Michał Niewrzał
d53aacc058 satellite/repair: migrate to new repair_queue table
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.

Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
2021-06-30 17:12:24 +02:00
Michał Niewrzał
a93e47514a satellite: remove irreparabledb
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.

Later we still need to drop table with migration step.

Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
2021-06-17 07:20:15 +00:00
Egon Elbre
910eec8eee satellite/metainfo: remove MetabaseDB interface
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.

Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.

Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
2021-05-13 13:22:14 +00:00
Egon Elbre
961e841bd7 all: fix error naming
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:

    ERROR node stats service error: satellitedbs error: node stats database error: no rows

Whereas something like:

    ERROR nodestats service: satellitedbs: nodestatsdb: no rows

Would contain all the necessary information without the stutter.

Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
2021-04-29 15:38:21 +03:00
Egon Elbre
4c9ed64f75 satellite/metabase/metaloop: move loop under metabase
Currently the loop handling is heavily related to the metabase rather
than metainfo.

metainfo over time has become related to the "public API" for accessing
the metabase data.

Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.

Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
2021-04-22 12:58:09 +03:00
Egon Elbre
267506bb20 satellite/metabase: move package one level higher
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.

metainfo is going to be the "endpoint" logic for handling requests.

Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
2021-04-21 15:54:22 +03:00
Fadila Khadar
bde367ae73 satellite/gc: check on bloom filter creation date
Check that the bloom filter creation date is earlier than the
metainfo loop system time used for db scanning.

Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4
2021-04-14 16:40:37 +00:00
Michał Niewrzał
a5224e7a6c satellite/metainfo/metaloop: use segment CreatedAt and RepairedAt
Repair checker expects to have information about CreatedAt and RepairedAt fields to calculate segment age metric.

Change-Id: I6b41df880d77133be541e14d10d91cc75759b339
2021-04-02 08:46:54 +00:00
Egon Elbre
f19ef4afe5 satellite/metainfo/metaloop: move loop to a separate package
Change-Id: I94c931a27c1af6062185ec62688624ec02050f11
2021-03-23 15:37:34 +00:00
Egon Elbre
4c0ea717eb satellite/metainfo: remove unneeded dependencies from Loop
metainfo.Loop doesn't require buckets nor pointerdb anymore.

Also:
* fix comments
* update full iterator limit to 2500

Change-Id: I6604402868f5c34079197c407f969ac8015e63c5
2021-02-19 15:11:16 +02:00
Egon Elbre
c860b74a37 satellite/repair/checker: allow for multipart objects
We have multipart objects so we may get multiple inline segments
sequences or no segments at all for objects.

Change-Id: Ie46ee777a2db8f18f7154e3443bb9e07ecb170f7
2021-02-18 20:31:49 +02:00
Michał Niewrzał
ec88d21a3c Merge 'main' branch.
Change-Id: I6e8162d1a6caf75e89c9f9c9f9522730aebf83ae
2021-01-11 10:26:58 +01:00
Moby von Briesen
a90d6fcad8 satellite/repair/checker: Use segment health on checker insert
Do not insert the number of healthy pieces for segment health anymore.
Rather, insert the segment health calculated by our new priority
function.

Change-Id: Ieee7fb2deee89f4d79ae85bac7f577befa2a0c7f
2021-01-04 11:48:17 -05:00
Kaloyan Raev
4d37d14929 satellite/{metrics,repair}: adjust monitoring to new metainfo loop
Change-Id: I87a2145daa5ed49bb2c08d6967baa09c0b14b4c6
2020-12-21 09:05:17 +02:00
Michal Niewrzal
2111740236 Merge 'master' branch
Change-Id: Ib73af0ff3ce0e9a1547b0b9fc55bf88704f6f394
2020-12-18 09:13:24 +01:00
paul cannon
d3604a5e90 satellite/repair: use survivability model for segment health
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).

Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.

Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.

This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.

Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
2020-12-17 21:30:17 +00:00
Michal Niewrzal
70ba4deea9 satellite/repair/checker: adjust irreparable part of repair checker
Change-Id: I0732104a97ba18a5359de3966cd692677a0ff790
2020-12-17 14:11:22 +00:00
Michal Niewrzal
2381ca2810 Merge 'master' branch
Change-Id: I4a3e45a2a2cdacfd87d16b148cfb4c6671c20b15
2020-12-17 13:17:17 +01:00
Michal Niewrzal
8d3ea9c251 satellite/repair/repairer: implement SegmentRepairer with metabase
Change-Id: I647c625e00a626c44e812602ad9bc3e85a7b602c
2020-12-17 10:47:21 +00:00
Cameron Ayer
8c52bb3a18 satellite/checker: use numHealthy as segment health in repair queue
A few weeks ago it was discovered that the segment health function
was not working as expected with production values. As a bandaid,
we decided to insert the number of healthy pieces into the segment
health column. This should have effectively reverted our means of
prioritizing repair to the previous implementation.

However, it turns out that the bandaid was placed into the code which
removes items from the irreparable db and inserts them into the repair
queue.

This change: insert number of healthy pieces into the repair queue in the
method, RemoteSegment

Change-Id: Iabfc7984df0a928066b69e9aecb6f615253f1ad2
2020-12-15 17:16:59 -05:00
Cameron Ayer
2ac72eaf16 satellite/repair/checker: add new monkit stats tagged with rs scheme
There is a new checker field called statsCollector. This contains
a map of stats pointers where the key is a stringified redundancy
scheme. stats contains all tagged monkit metrics. These metrics exist
under the key name, "tagged_repair_stats", which is tagged with the
name of each metric and a corresponding rs scheme.

As the metainfo observer works on a segment, it checks statsCollector
for a stats corresponding to the segment's redundancy scheme. If one
doesn't exist, it is created and chained to the monkit scope. Now we can call
Observe, Inc, etc on the fields just like before, and they have tags!

durabilityStats has also been renamed to aggregateStats.

At the end of the metainfo loop, we insert the aggregateStats totals into the
corresponding stats fields for metric reporting.

Change-Id: I8aa1918351d246a8ef818b9712ed4cb39d1ea9c6
2020-12-15 14:08:01 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Moby von Briesen
75f0f713a3 satellite/repair/checker/checker.go: Use number of healthy pieces instead of SegmentHealth for injured segments queue.
We did not test the SegmentHealth function with actual production
values, and it turns out that values such as 52 healthy, 35 minimum
result in +Inf segment health - so pretty much all segments put into the
repair queue have the same health, which means we effectively aren't
sorting by health.

This change inserts numHealthy as segment health into the database so
the segments are ordered as they were before. We need to refine the
SegmentHealth function before we can support multi RS.

Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74
2020-11-28 12:16:32 -05:00
Moby von Briesen
575f50df84 satellite/repair: Update repair override config to support multiple RS schemes.
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.

The new format for RS override values is
"k/o/n-override,k/o/n-override..."

Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
2020-11-23 18:01:15 +00:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
Cameron Ayer
da9f1f0611 satellite/repair: add monkit counter for segments below minimum required
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.

Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
2020-11-11 12:48:23 +00:00
Cameron Ayer
d63b7658e8 satellite/repair: fix lastSeenSegmentKey bug in IrreparableProcess
A change was made to use a metabase.SegmentKey (a byte slice alias)
as the last seen item to iterate through the irreparable DB in a
for loop. However, this SegmentKey was not initialized, thus it was
nil. This caused the DB query to return nothing, and healthy segments
could not be cleaned out of the irreparable DB.

Change-Id: Idb30d6fef6113a30a27158d548f62c7443e65a81
2020-11-09 14:48:15 +00:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Egon Elbre
004e610d0f satellite/internalpb: move datarepair.pb to internal
Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d
2020-10-30 13:28:14 +02:00
littleskunk
ed1f6d7973
satellite/config: move repair override from config to default (#3958)
Co-authored-by: Igor <38665104+ihaid@users.noreply.github.com>
2020-10-28 17:24:39 +02:00
Kaloyan Raev
92a2be2abd satellite/metainfo: get away from using pb.Pointer in Metainfo Loop
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.

After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service

It would require additional refactoring in these two services before we
are able to clean this.

Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
2020-10-27 13:06:47 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Cameron Ayer
c2525ba2b5 satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.

solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.

Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
2020-09-29 20:38:22 +00:00
Michal Niewrzal
27a9d14e2a satellite/repair: use metabase.SegmentKey type in repair package
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)

Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
2020-09-08 19:35:20 +00:00
Michal Niewrzal
9202295348 satellite/metainfo: replace ScopedPath with metabase.SegmentLocation
Change-Id: I7e89c9e8eaeae58be828a32ad47ed3028501f4c7
2020-09-04 10:06:52 +00:00
Michal Niewrzal
0604a672c1 satellite/metainfo: use metabase in loop
Change-Id: I1bb0c6fe0a762895fde950690b06f7dd9d77e178
2020-09-01 10:06:16 +00:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
paul cannon
4997fd55d0 satellite/repair: remove healthy from irreparabledb
Change-Id: Ia9d300d0359883f03734d0bdf204d56d6642ce34
2020-06-26 21:26:00 +00:00
Moby von Briesen
290c006a10 satellite/repair/{checker,queue}: add metric for new segments added to repair queue
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration

Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
2020-05-27 06:23:47 +00:00
Moby von Briesen
178aa8b5e0 satellite/{metainfo,repair}: Delete expired segments from metainfo
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue

Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
2020-04-22 13:02:31 +00:00
paul cannon
ba5991dc86 satellite/repair: add monitoring for remote_segments_healthy_percentage
Change-Id: I6ad29fe1a947ac19d15e40ea33164a510eb33d4f
2020-03-17 17:45:59 +00:00
Moby von Briesen
e4da7bd9cd satellite/repair/checker: use repair override if available in checker and irreparable
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.

Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
2020-03-06 15:39:53 -05:00
Moby von Briesen
d5540c89a1 satellite/repair/checker: add monkit metrics for segments immediately above repair threshold
Record counts for segments at health=rt+1 through health=rt+5 for every checker
iteration.

Change-Id: I2a00c0bc34d17beb21cacdeab4dac77f755faefe
2020-02-26 20:27:15 +00:00
Moby von Briesen
4e5a7f13c7 satellite/repair/queue: Prioritize selection of items off repair queue by segment health
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.

The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.

We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.

Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
2020-02-26 09:54:16 -05:00