satellite/repair: fix flaky test TestECREpairerGetOffline

It was possible to get into a situation where successfulPieces =
es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when
the succeeding gets all completed before the failures), whereupon the
last goroutine in the limiter would sit and wait forever for another
goroutine to finish.

This change corrects the handling of that situation.

As an aside, this is really pretty confusing code and we should think
about redoing the whole function.

Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69
This commit is contained in:
paul cannon 2023-02-24 08:53:41 -06:00
parent 4a6e34bb2c
commit 20bcdeb8b1

View File

@ -121,7 +121,14 @@ func (ec *ECRepairer) Get(ctx context.Context, limits []*pb.AddressedOrderLimit,
return
}
if successfulPieces+inProgress >= es.RequiredCount() {
if successfulPieces+inProgress >= es.RequiredCount() && errorCount+inProgress >= minFailures {
// we know that inProgress > 0 here, since we didn't return on the
// "successfulPieces >= es.RequiredCount() && errorCount >= minFailures" check earlier.
// There may be enough downloads in progress to meet all of our needs, so we won't
// start any more immediately. Instead, wait until all needs are met (in which case
// cond.Broadcast() will be called) or until one of the inProgress workers exits
// (in which case cond.Signal() will be called, waking up one waiter) so we can
// reevaluate the situation.
cond.Wait()
continue
}