Commit Graph

21 Commits

Author SHA1 Message Date
paul cannon
ba5991dc86 satellite/repair: add monitoring for remote_segments_healthy_percentage
Change-Id: I6ad29fe1a947ac19d15e40ea33164a510eb33d4f
2020-03-17 17:45:59 +00:00
Moby von Briesen
e4da7bd9cd satellite/repair/checker: use repair override if available in checker and irreparable
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.

Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
2020-03-06 15:39:53 -05:00
Moby von Briesen
d5540c89a1 satellite/repair/checker: add monkit metrics for segments immediately above repair threshold
Record counts for segments at health=rt+1 through health=rt+5 for every checker
iteration.

Change-Id: I2a00c0bc34d17beb21cacdeab4dac77f755faefe
2020-02-26 20:27:15 +00:00
Moby von Briesen
4e5a7f13c7 satellite/repair/queue: Prioritize selection of items off repair queue by segment health
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.

The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.

We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.

Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
2020-02-26 09:54:16 -05:00
Jeff Wendling
7999d24f81 all: use monkit v3
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.

graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.

it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.

Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
2020-02-05 23:53:17 +00:00
Egon Elbre
8dea4f52db satellite: add control panel
Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b
2020-01-30 08:06:43 -05:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
Natalie Villasana
cf430d2d73
scripts: add check-monitoring script to detect changes to monkit calls (#3114) 2019-10-15 13:00:14 -04:00
littleskunk
b908a09c8e
satellite/repair: remove deprecated error message (#3193) 2019-10-06 20:54:20 +02:00
Stefan Benten
1db4251234 Satellite/repair: Add Repair Threshold Override to allow earlier repair (#3151) 2019-10-02 14:58:37 +02:00
Michal Niewrzal
1c72e80e40 uplink/satellite: fix for case when inline segment is last one (#3062)
* uplink/satellite: fix when inline seg is last one

* review comments
2019-09-19 01:18:14 +02:00
Natalie Villasana
cc70cd2329
satellite/repair: add metric trackers for segment age before repair (#3056) 2019-09-17 15:18:48 -04:00
Egon Elbre
7240e6cbb2
satellite: remove remote/inline file from BucketTally (#3041) 2019-09-13 16:51:41 +03:00
Egon Elbre
8b668ab1f8
satellite/metainfo.Loop: use a parsed path for observers (#3003) 2019-09-12 13:38:49 +03:00
Egon Elbre
a801fab66a
all: add archview annotations (#2964) 2019-09-10 16:24:16 +03:00
Maximillian von Briesen
fb10815229 Repair with hashes (#2925)
* add outline for ECRepairer

* add description of process in TODO comments

* begin download/getting hash for a single piece

* verify piece hash and order limit during download

* fix download piece

* begin filling out ESREpair. Get

* wip move ecclient.Repair to ecrepairer.Repair

* pass satellite signee into repairer

* reconstruct original stripe from pieces

* move rebuildStripe()

* calculate piece size differently, increment successful count

* fix shares slices initialization

* rename stripeData to segment

* do not pad reader in Repair()

* temp debug

* create unsafeRSScheme

* use decode reader

* rename file name to be all lowercase

* make repair downloader async

* declare condition variable inside Get method

* set downloadAndVerifyPiece's in-memory buffer to be share size

* update unusedLimits var

* address comments

* remove unnecessary comments

* move initialization of segmentRepaire to be outside of repairer service

* use ReadAll during download

* remove dots and move hashing to after validating for order limit signature

* wip test

* make sure files exactly at min threshold are repaired

* remove unused code

* use corrput data and write back to storagenode

* only create corrupted node and piece ids once

* add comment

* address nat's comment

* fix linting and checker_test

* update comment

* add comments

* remove "copied from ecclient" comments

* add clarification comments in ec.Repair
2019-09-06 15:20:36 -04:00
Egon Elbre
00b2e1a7d7 all: enable staticcheck (#2849)
* by having megacheck in disable it also disabled staticcheck

* fix closing body

* keep interfacer disabled

* hide bodies

* don't use deprecated func

* fix dead code

* fix potential overrun

* keep stylecheck disabled

* don't pass nil as context

* fix infinite recursion

* remove extraneous return

* fix data race

* use correct func

* ignore unused var

* remove unused consts
2019-08-22 13:40:15 +02:00
Egon Elbre
c8edeb0257
satellite/overlay: rename overlay.Cache to overlay.Service (#2717) 2019-08-06 19:35:59 +03:00
Maximillian von Briesen
4547084f26
Implement checker observer (#2620)
Integrate checker into metainfo loop
2019-08-01 14:44:32 -04:00
Egon Elbre
5d0816430f
rename all the things (#2531)
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair
2019-07-28 08:55:36 +03:00