At the moment segment repairer is skipping offline nodes in checks like
clumped pieces and off placement pieces. This change is fixing this
problem using new version of KnownReliable method. New method is
returning both online and offline nodes. Provided data can be used to
find clumped and off placement pieces.
We are not using DownloadSelectionCache anymore with segment repairer.
https://github.com/storj/storj/issues/5998
Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18
When pieces fail an audit (hard fail, meaning the node acknowledged it
did not have the piece or the piece was corrupted), we will now remove
those pieces from the segment.
Previously, we did not do this, and some node operators were seeing the
same missing piece audited over and over again and losing reputation
every time.
This change will include both verification and reverification audits. It
will also apply to pieces found to be bad during repair, if
repair-to-reputation reporting is enabled.
Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7
Additional test case to cover situation where we are trying to
repair segment with specific placement set. We need to be sure
that segment won't be repaired into nodes that are outside
segment placement, even if that means that repair will fail.
Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91
This change makes dial timeout configurable and change it also from
defatul 20s to 5s. Main motivation is that during repair we often loose
lots of time to dial which eventually will fail. New timeout should be
still enough to dial but we will move forward quicker to next node if
that one will fail.
Timeout is also applied directly as context timeout in case we will
use noise of tcp fast open one day.
Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea
We missed to set placement as a part of selection request. It can case
uploading repaired data out of specified placement.
I will provide test as a separate change.
Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.
This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.
Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.
Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.
Fixes https://github.com/storj/storj/issues/5895
Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
Segment repairer should take into account segment 'placement' field
and remove or repair pieces from nodes that are outside this placement.
In case when after considering pieces out of placement we are still above
repair threshold we are only updating segment pieces to remove
problematic pieces. Otherwise we are doing regular repair.
https://github.com/storj/storj/issues/5896
Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59
rather than only logging the last_nets we see in clumpedPieces, this
will run through all the last_nets and log any that have more than one
node. This should have the same outcome, except the counts will be 1
higher (because FindClumpedPieces won't include the first node found in
a clumped network, and this will).
This should be quite a bit faster.
Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a
We were using the UploadSelectionCache previously, which does _not_ have
all nodes, or even all online nodes, in it. So all nodes with less than
MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended
for unknown audit errors, or nodes that have started graceful exit, were
all missing, and ended up having empty last_nets. Even with all that,
I'm kind of surprised how many nodes this involved, but using the upload
selection cache was definitely wrong.
This change uses the download selection cache instead, which excludes
nodes only when they are disqualified, gracefully exited (completely),
or offline.
Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.
Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.
Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
Clumped segments (segments with multiple pieces on the same subnet) may
need repair, but the clumped pieces are considered retrievable and we
don't need to call such segments irreparable.
We do want to know where they're coming from, though, if we can, because
we are seeing more than expected.
Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9
The query for GetNodesNetworkInOrder is causing far too much load on the
database. Since it is not critical that the repair checker have
perfectly up-to-date node network information, we can use a cache
instead.
Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878
We will remove segments loop soon so we need first to move
Segment definition to rangedloop package.
https://github.com/storj/storj/issues/5237
Change-Id: Ibe6aad316ffb7073cc4de166f1f17b87aac07363
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.
This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).
Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
The blobstore implementation is entirely related to storagenode, so the
rightful place is together with the storagenode implementation.
Fixes https://github.com/storj/storj/issues/5754
Change-Id: Ie6637b0262cf37af6c3e558556c7604d9dc3613d
The following tests should be made less flaky by this change:
- TestFailedDataRepair
- TestOfflineNodeDataRepair
- TestUnknownErrorDataRepair
- TestMissingPieceDataRepair_Succeed
- TestMissingPieceDataRepair
- TestCorruptDataRepair_Succeed
- TestCorruptDataRepair_Failed
This follows on to a change in commit 6bb64796. Nearly all tests in the
repair suite used to rely on events happening in a certain order. After
some of our performance work, those things no longer happen in that
expected order every time. This caused much flakiness.
The fix in 6bb64796 was sufficient for the tests operating directly on
an `*ECRepairer` instance, but not for the tests that make use of the
repairer by way of the repair queue and the repair worker. These tests
needed a different way to indicate the number of expected failures. This
change provides that different way.
Refs: https://github.com/storj/storj/issues/5736
Refs: https://github.com/storj/storj/issues/5718
Refs: https://github.com/storj/storj/issues/5715
Refs: https://github.com/storj/storj/issues/5609
Change-Id: Iddcf5be3a3ace7ad35fddb513ab53dd3f2f0eb0e
It was surprising that `satellite auditor` complained about SMTP mail settings, even if it's not supposed to sending any mail.
Looks like we can remove the mail service dependency, as it's not a hard requirement for overlay.Service.
Change-Id: I29a52eeff3f967ddb2d74a09458dc0ee2f051bd7
It was possible to get into a situation where successfulPieces =
es.RequiredCount(), errorCount < minFailures, and inProgress == 0 (when
the succeeding gets all completed before the failures), whereupon the
last goroutine in the limiter would sit and wait forever for another
goroutine to finish.
This change corrects the handling of that situation.
As an aside, this is really pretty confusing code and we should think
about redoing the whole function.
Change-Id: Ifa3d3ad92bc755e563fd06b2aa01ef6147075a69
This code is essentially replacement for eestream.CalcPieceSize. To call
eestream.CalcPieceSize we need eestream.RedundancyStrategy which is not
trivial to get as it requires infectious.FEC. For example infectious.FEC
creation is visible on GE loop observer CPU profile because we were
doing this for each segment in DB.
New method was added to storj.Redundancy and here we are just wiring it
with metabase Segment.
BenchmarkSegmentPieceSize
BenchmarkSegmentPieceSize/eestream.CalcPieceSize
BenchmarkSegmentPieceSize/eestream.CalcPieceSize-8 5822 189189 ns/op 9776 B/op 8 allocs/op
BenchmarkSegmentPieceSize/segment.PieceSize
BenchmarkSegmentPieceSize/segment.PieceSize-8 94721329 11.49 ns/op 0 B/op 0 allocs/op
Change-Id: I5a8b4237aedd1424c54ed0af448061a236b00295
Additional elements added:
* monkit metric for observers methods like Start/Fork/Join/Finish to
be able to check how much time those methods are taking
* few more logs e.g. entries with processed range
* segmentsProcessed metric to be able to check loop progress
Change-Id: I65dd51f7f5c4bdbb4014fbf04e5b6b10bdb035ec
Several tests using `(*ECRepairer).Get()` have begun to exhibit flaky
results. The tests are expecting to see failures in certain cases, but
the failures are not present. It appears that the cause of this is that,
sometimes, the fastest good nodes are able to satisfy the repairer
(providing RequiredCount pieces) before the repairer is able to identify
the problem scenario we have laid out.
In this commit, we add an argument to `(*ECRepairer).Get()` which
specifies how many failure results are expected. In normal/production
conditions, this parameter will be 0, meaning Get need not wait for
any errors and should only report those that arrived while waiting for
RequiredCount pieces (the existing behavior). But in these tests, we can
request that Get() wait for enough results to see the errors we are
expecting.
Refs: https://github.com/storj/storj/issues/5593
Change-Id: I2920edb6b5a344491786aab794d1be6372c07cf8
implemented observer and partial, created new structures to keep mon
metrics remain in same way as in segment loop
Change-Id: I209c126096c84b94d4717332e56238266f6cd004
The Reporter is responsible for processing results from auditing
operations, logging the results, disqualifying nodes that reached
the maximum reverification count, and passing the results on to
the reputation system.
In this commit, we extend the Reporter so that it knows how to process
the results of piecewise reverification audits.
We also change most reporter-related tests so that reverifications
happen as piecewise reverification audits, exercising the new code.
Note that piecewise reverification audits are not yet being done outside
of tests. In a later commit, we will switch from doing segmentwise
reverifications to piecewise reverifications, as part of the
audit-scaling effort.
Refs: https://github.com/storj/storj/issues/5230
Change-Id: I9438164ce1ea4d9a1790d18d0e1046a8eb04d8e9
Reputation updates during repair currently consumes a lot of database
resources. Sometimes increasing the rate of repair is more important
than auditing a node based on whether they have or don't have the
correct piece during repair. This is the job of the audit service.
This commit is to implement an intermediate solution from this issue: https://github.com/storj/storj/issues/5089
This commit does not address the more in-depth fix discussed here: https://github.com/storj/storj/issues/4939
Change-Id: I4163b18d78a96fadf5265789fd73c8aa8def0e9f
As part of the effort of splitting out the auditor workers to their own
process, we are transitioning the communication between the auditor
chore and the verification workers to a queue implemented in the
database, rather than the sequence of in-memory queues we used to use.
This logical database is safely partitionable from the rest of
satelliteDB.
Refs: https://github.com/storj/storj/issues/5251
Change-Id: I6cd31ac5265423271fbafe6127a86172c5cb53dc
Add nodeevents.DB to satellite overlay service so we can insert node
events into the nodeevents DB.
Change-Id: I642c0ccc9941ecdb08cb22d5c8cf701959a55156
We want to send emails to SNOs. Node status changes go through the
overlay service, so it's a good place to add the mail service.
Add the mailservice.Service, satellite address, and satellite name to
overlay service. Also add feature flag --overlay.send-node-emails
Change-Id: I3bd2cb3bf22f9724954ce2374f8b651b902b3a24
It looks that monikt monitoring can give high CPU overhead for
segments loop observer. With this code we are changing how monitoring
is initialized for observer methods. This optimization affects mainly
path where segment is healthy and doesn't require repair. Benchmark
is also added to show difference between old and new approach.
Benchmark against 'main':
name old time/op new time/op delta
RemoteSegment/Cockroach/healthy_segment-8 8.55µs ± 4% 1.37µs ± 6% -84.03% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/Cockroach/healthy_segment-8 2.63kB ± 0% 0.17kB ± 0% -93.62% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RemoteSegment/Cockroach/healthy_segment-8 54.0 ± 0% 8.0 ± 0% -85.19% (p=0.008 n=5+5)
Change-Id: Ie138eab0d59e436395b13f57bdfb11f9871d4c18
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.
Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.
Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.
We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.
Refs: https://github.com/storj/storj/issues/4669
Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
This structure is entirely unused within the audit module, and is only
used by repair code. Accordingly, this change moves the structure from
audit code to repair code.
Also, we take the opportunity here to rename the structure to something
less generic.
Refs: https://github.com/storj/storj/issues/4669
Change-Id: If85b37e08620cda1fde2afe98206293e02b5c36e
This is in response to community feedback that our existing reputation
calculation is too likely to disqualify storage nodes unfairly with
extreme swings up and down.
For details and analysis, please see the data_loss_vs_dq_chance_sim.py
tool, the "tuning reputation further.ipynb" Jupyter notebook in the
storj/datascience repository, and the discussion at
https://forum.storj.io/t/tuning-audit-scoring/14084
In brief: changing the lambda and initial-alpha parameters in this way
causes the swings in reputation to be smaller and less likely to put a
node past the disqualification threshold unfairly.
Note: this change will cause a one-time reset of all (non-disqualified)
node reputations, because the new initial alpha value of 1000 is
dramatically different, and the disqualification threshold is going to
be much higher.
Change-Id: Id6dc4ba8fde1be3db4255b72282207bab5491ca3
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.
Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
We made optimization for segment loop observers to avoid
heavy monkit initialization on each call. It was applied to very
often executed methods. Unfortunately we used wrong monkit
method to track function times. Instead mon.Task we used
mon.Func().
https://github.com/spacemonkeygo/monkit#how-it-works
Change-Id: I9ca454dbd828c6b43ba09ca75c341991d2fd73a8
We retry a GET_REPAIR operation in one case, and one case only (as far
as I can determine): when we are trying to connect to a node using its
last known working IP and port combination rather than its supplied
hostname, and we think the operation failed the first time because of a
Dial failure.
However, logs collected from storage node operators along with logs
collected from satellites are strongly indicating that we are retrying
GET_REPAIR operations in some cases even when we succeeded in connecting
to the node the first time. This results in the node complaining loudly
about being given a duplicate order limit (as it should), whereupon the
satellite counts that as an unknown error and potentially penalizes the
node.
See discussion at
https://forum.storj.io/t/get-repair-error-used-serial-already-exists-in-store/17922/36
.
Investigation into this problem has revealed that
`!piecestore.CloseError.Has(err)` may not be the best way of determining
whether a problem occurred during Dial. In fact, it is probably
downright Wrong. Handling of errors on a stream is somewhat complicated,
but it would appear that there are several paths by which an RPC error
originating on the remote side might show up during the Close() call,
and would thus be labeled as a "CloseError".
This change creates a new error class, repairer.ErrDialFailed, with
which we will now wrap errors that _really definitely_ occurred during
a Dial call. We will use this class to determine whether or not to retry
a GET_REPAIR operation. The error will still also be wrapped with
whatever wrapper classes it used to be wrapped with, so the potential
for breakage here should be minimal.
Refs: https://github.com/storj/storj/issues/4687
Change-Id: Ifdd3deadc8258f34cf3fbc42aff393fa545794eb