Commit Graph

127 Commits

Author SHA1 Message Date
Ivan Fraixedes
f420b29d35
[V3-1927] Repairer uploads to max threshold instead of success… (#2423)
* pkg/datarepair: Add test to check num upload pieces
  Add a new test for ensuring the number of pieces that the repair process
  upload when a segment is injured.
* satellite/orders: Don't create "put order limits" over total
  Repair must not create "put order limits" more than the total count.
* pkg/datarepair: Update upload repair pieces test
  Update the test which checks the number of pieces which are uploaded
  during a repair for using the same excess over the success threshold
  value than the implementation.
* satellites/orders: Limit repair put order for not being total
  Limit the number of put orders to be used by repair for only uploading
  pieces to a % excess over the successful threshold.
* pkg/datarepair: Change DataRepair test to pass again
  Make some changes in the DataRepair test to make pass again after the
  repair upload repaired pieces only until a % excess over success
  threshold.
  Also update the steps description of the DataRepair test after it has been
  changed, to match on what's now, besides to leave it more generic for
  avoiding having to update it on minimal future refactorings.
* satellite: Make repair excess optimal threshold configurable
  Add a new configuration parameter to the satellite for being able to
  configure the percentage excess over the optimal threshold, used for
  determining how many pieces should be repaired/uploaded, rather than
  having the value hard coded.
* repairer: Add configurable param to segments/repairer
  Add a new parameters to the segment/repairer to calculate the maximum
  number of excess nodes, based on the optimal threshold, that repaired
  pieces can be uploaded.
  This new parameter has been added for not returning more nodes than the
  number of upload orders for data repair satellite service calculate for
  repairing pieces.
* pkg/storage/ec: Update log message in clien.Repair
* satellite: Update configuration lock file
2019-07-12 00:44:47 +02:00
Michal Niewrzal
268c629ba8
Replace base64 encoding for path segments (#2345) 2019-07-11 13:26:07 -04:00
Maximillian von Briesen
de85d17069
Add checker metrics (#2487)
checker_segment_total_count - Number of total segments in pointer during checker iteration
checker_segment_healthy_count - Number of healthy segments in pointer during checker iterationn
time_since_checker_queue - Seconds elapsed between checker queue and beginning repair
time_for_repair - Seconds elapsed between beginning repair and ending repair/dequeueing
2019-07-10 17:27:46 -04:00
JT Olio
a79c7d77f3 overlay cache: slight modification of node-is-online rules (#2490) 2019-07-09 22:36:09 -04:00
Alexander Leitner
3587e1a579 Change pointerdb pointer to use time.Time for Creation date (#2483) 2019-07-09 00:16:50 +02:00
Egon Elbre
674742d1a7
satellite/datarepair: use reliability cache (#1976) 2019-07-09 01:04:35 +03:00
Ivan Fraixedes
ea9a8b37bc pkg/datarepair/checker: Fix typo in doc comment (#2461) 2019-07-05 22:20:58 +02:00
Maximillian von Briesen
52e5a4eee3 pass logger into repairer and ecclient (#2365) 2019-07-02 13:08:02 +03:00
Natalie Villasana
4ba212107b
add back testrand (#2408) 2019-07-01 12:39:04 -04:00
Ivan Fraixedes
b6ae6a4e2c
pkg/datarepair/checker: Fix doc comments (#2401) 2019-07-01 18:13:18 +02:00
Natalie Villasana
3ffb42483f
account for disqualified nodes in data repair tests (#2315) 2019-07-01 11:34:42 -04:00
Natalie Villasana
3f643551e7 remove flakiness in TestDataRepair and TestSegmentStoreRepair (#2335)
* stop audit loop in repair tests to prevent possible timeout
2019-07-01 11:15:45 -04:00
Egon Elbre
b6ad3e9c9f
internal/testrand: new package for random data (#2282) 2019-06-26 13:38:51 +03:00
Egon Elbre
c28f800098
Skip TestDataRepair and TestUplinksParallel, because they are flaky (#2337) 2019-06-25 19:30:39 +03:00
Kaloyan Raev
964c87c476 Fix checks around repair threshold (#2246) 2019-06-19 22:13:11 +02:00
JT Olio
e58a06bd0c config: update release values to match prod (#2192) 2019-06-15 18:19:19 +02:00
Kaloyan Raev
ebd9b375fc
Repair should not corrupt files (#2194) 2019-06-14 12:16:31 +03:00
ethanadams
8f2dca8437 Re-enabling and fixing repairer tests (#2099)
* Disabled discovery service by changiing from Stop() to Pause()

Paused to solve race condition.  If discovery is running, it may mark a node "up" after they've been manually marked "down" in this test.

* Extend to the repair timeout

Fixes intermittent test failures when repairs were taking more than 2 seconds.

* Re-enabled test. Disabled discovery service by changiing from Stop() to Pause()

* Changed back to Stop.

* Revert "Changed back to Stop."

This reverts commit 46d410e72dfae63e0c44915be42784cc9a7b5abf.

* re-enabling TestIdentifyInjuredSegments

* Changed Pause to Stop.  Commented on timeout change

* testing...

* temporarily skipping audit tests

* changing back to discover Stop for testing via jenkins

* Revert "changing back to discover Stop for testing via jenkins"

This reverts commit 6aa8558b11a0053c30e0c8b2dbf0d6c0cb34ee6c.

* Changing back to Stop().  Depends on PR 2137

* Revert "temporarily skipping audit tests"

This reverts commit 1940ed9b315d663a0eb6c95521780cbcb48cb121.

* Removed reference to Graveyard since its been removed
2019-06-10 09:06:21 +02:00
JT Olio
43d4f3daf5 discovery: remove graveyard (#2145) 2019-06-07 08:40:51 +03:00
JT Olio
f1641af802 storage: add monkit task to missing places (#2122)
* storage: add monkit task to missing places

Change-Id: I9e17a6b14f7c25bbf698eeecf32785e9add3f26e

* fix tests

Change-Id: Id078276fa3de61a28eb3d01d4e751732ecbb173f

* import order

Change-Id: I814e33755b9f10b5219af37cd828cd75eb3da1a4

* remove part of other commit

Change-Id: Idaa4c95cd65e97567fb466de49718db8203cfbe1
2019-06-05 16:23:10 +02:00
JT Olio
3fe8343b6c repairer: fix config comments (#2105) 2019-06-04 14:13:31 +02:00
JT Olio
9c5708da32 pkg/*: add monkit task to missing places (#2109) 2019-06-04 13:36:27 +02:00
aligeti
4ad5120923
Checker service refactor (v3-1871) (#2082)
*  refactor the checker service

* monkit update
2019-05-31 10:12:49 -04:00
aligeti
934ebf9cbf
Added the irreparable repair functionality (#1955)
* Added the irreparable repair functionality
2019-05-30 11:18:20 -04:00
Maximillian von Briesen
da91d22376 properly check last iteration of checker (#2040) 2019-05-23 18:14:08 +02:00
Maximillian von Briesen
b4f18226db
Send number of files as part of durability stats (#2030) 2019-05-22 18:50:43 -04:00
Maximillian von Briesen
45a2253628 Send durability stats after iterating over all segments (#2028) 2019-05-22 17:17:52 -04:00
Bill Thorp
6522579ecb better repairer logging (#2006)
* logging and delete only repairs with no errors

* removing delete logi~c
2019-05-21 00:05:28 +02:00
Bill Thorp
91721f63ba
Bt/repair no nodes (#1974)
* handle cases where repair is equal to total
2019-05-17 15:02:40 -04:00
aligeti
60cf1dafb0
repair segment reassess it missing pieces just before repair (#1939)
* repair segment reaccess it missing pieces just before repair to see if it actually needs repair
2019-05-16 09:49:10 -04:00
Bill Thorp
4002ed4463 unskip TestIdentifyIrreparableSegments (#1927) 2019-05-09 15:55:34 +03:00
Natalie Villasana
b48f584cea
repair checker resumes iterating where left off (#1879) 2019-05-08 13:59:50 -04:00
Bill Thorp
ea978dd674
hopefully sensible satellite defaults (#1888)
* hopefully sensible satellite defaults
2019-05-07 10:44:47 -04:00
Bill Thorp
6ece4f11ad
moved invalid/offline back into SQL (#1838)
* moved invalid/offline back into SQL, removed GetAll()
2019-05-01 09:45:52 -04:00
Bill Thorp
2c9ef5b107
longer repair window (#1866) 2019-04-30 11:20:18 -04:00
Egon Elbre
db939d37ec
cover all the things (#1818) 2019-04-26 16:39:11 +03:00
Michal Niewrzal
fe3dfc1587
Move pointerdb.Service to satellite (#1826) 2019-04-25 10:46:32 +02:00
Egon Elbre
c284cfde30
ensure TestParallel doesn't deadlock on error (#1808) 2019-04-24 13:15:46 +03:00
Bill Thorp
cd4a3e06d8
wired up IsHealthy to config (#1820)
* wired up IsHealthy to config
2019-04-23 18:45:50 -04:00
Fadila
8ddf481b33 Checker: invalid and offline nodes search update (#1812)
* simplified invalid and offline login into getMissingPieces
2019-04-23 16:54:39 -04:00
Egon Elbre
bdd0d778eb
interface tests belong to the interface, not the implementation (#1794) 2019-04-23 11:47:16 +03:00
Egon Elbre
f49b0acc5b repair until queue is empty (#1716)
* repair until queue is empty
2019-04-22 11:16:21 -04:00
Natalie Villasana
8d1f614662 removes unused queue code, moves queue_test.go to repairqueue_test.go in satellitedb dir (#1783) 2019-04-22 13:35:52 +03:00
Bill Thorp
9dc4e82437
removed commented code, removed unnecessary pointer (#1766)
* removed commented code
2019-04-16 15:55:28 -04:00
Bill Thorp
17a227e6e9
refactor injuredsegments db so that we can't have duplicates (#1717)
made repairqueue not use a true queue, forbid duplicates
2019-04-16 14:14:09 -04:00
JT Olio
ffdb2e7728
actually skip the data repair test (#1728)
Change-Id: I76286fc6cc5129d8be50d45a684a3e0dce9c0cc6
2019-04-09 23:29:05 -06:00
JT Olio
61ec92f2e8
disable datarepair test for now (#1727)
Change-Id: I1854817ea051ab621936f587b198de2da07c9960
2019-04-09 22:31:55 -06:00
Maximillian von Briesen
3fb4813227
Fix data repair checker missing pieces list (#1705) 2019-04-08 15:46:23 -04:00
Maximillian von Briesen
bb3b4e4816 Data repair integration test (#1582) 2019-04-08 13:33:47 -04:00
littleskunk
43ef0eb4c3
Don't crash on audit and repair failures (#1622)
* Fix satellite crash on repair

(cherry picked from commit cabf6c9f97780f900d76e2388ffa54b916f14528)

* Fix satellite crash on audit

(cherry picked from commit 9da67488c4b36a378f346fbb27651316284b0f36)
2019-04-01 11:16:17 +02:00