We have frequently wanted to see this sort of information when looking
into issues. Considering that we already log a line at Info every time
we fail to download a piece during repair, it doesn't seem like a very
onerous new burden to log one line per segment saying that it is being
repaired.
Change-Id: I1fa84a985e90473adeb02603e207fc3c7b8592da
This is fixing two small issues with logging:
* repair checker was logging empty last net values as clumped pieces
but in main logic we stopped classifying it this way
* repairer summary log was showing incorrect number of pieces removed
from segment because list contains duplicated entries
There was no real issue here.
Change-Id: Ifc6a83637d621d628598200cad00cce44fa4cbf9
ClassifySegmentPieces uses custom set implementation instead map.
Side note, for custom set implementation I also checked int8 bit set but
it didn't give better performance so I used simpler implementation.
Benchmark results (compared against part 2 optimization change):
name old time/op new time/op delta
RemoteSegment/healthy_segment-8 21.7µs ± 8% 15.4µs ±16% -29.38% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/healthy_segment-8 7.41kB ± 0% 1.87kB ± 0% -74.83% (p=0.000 n=5+4)
name old allocs/op new allocs/op delta
RemoteSegment/healthy_segment-8 150 ± 0% 130 ± 0% -13.33% (p=0.008 n=5+5)
Change-Id: I21feca9ec6ac0a2558ac5ce8894451c54f69e52d
It would appear that we have been making concurrent accesses to
statsCollector for a long, long time (we expect there to be multiple
calls to `Repair()` at the same time on the same instance of
`SegmentRepairer`, up to `config.MaxRepair`, and before this change
there was no sort of synchronization guarding accesses to the
`statsCollector.stats` map.
Refs: https://github.com/storj/storj/issues/6402
Change-Id: I5bcdd13c88913a8d66f6dd906c9037c588960cc9
Optimizing collecting monkit metrics:
* initialize metrics once at the begining
* avoid using string in map for getting stats structs per redundancy
Benchmark results (compared against part 1 optimization change):
name old time/op new time/op delta
RemoteSegment/Cockroach/healthy_segment-8 31.4µs ± 6% 21.7µs ± 8% -30.73% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RemoteSegment/healthy_segment-8 10.2kB ± 0% 7.4kB ± 0% -27.03% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RemoteSegment/healthy_segment-8 250 ± 0% 150 ± 0% -40.00% (p=0.008 n=5+5)
Change-Id: Ie09476eb469a4d6c09e52550c8ba92b3b4b34271
With the upcoming versioning changes `BeginObjectExactVersion` makes
only sense for testing. Currently this does not rename the options
struct or move it into `metabasetest`, because it would create a
significant amount of merge/rebase noise.
Change-Id: Iafa2f81a05ae66320bc6a839828217ec94c63e1f
This patch finishes the placement aware repair.
We already introduced the parameters to select only the jobs for specific placements, the remaining part is just to configure the exclude/include rules. + a full e2e unit test.
Change-Id: I223ba84e8ab7481a53e5a444596c7a5ae51573c5
Currently, graceful exit is a complicated subsystem that keeps a queue
of all pieces expected to be on a node, and asks the node to transfer
those pieces to other nodes one by one. The complexity of the system
has, unfortunately, led to numerous bugs and unexpected behaviors.
We have decided to remove this entire subsystem and restructure graceful
exit as follows:
* Nodes will signal their intent to exit gracefully
* The satellite will not send any new pieces to gracefully exiting nodes
* Pieces on gracefully exiting nodes will be considered by the repair
subsystem as "retrievable but unhealthy". They will be repaired off of
the exiting node as needed.
* After one month (with an appropriately high online score), the node
will be considered exited, and held amounts for the node will be
released. The repair worker will continue to fetch pieces from the
node as long as the node stays online.
* If, at the end of the month, a node's online score is below a certain
threshold, its graceful exit will fail.
Refs: https://github.com/storj/storj/issues/6042
Change-Id: I52d4e07a4198e9cb2adf5e6cee2cb64d6f9f426b
The repair checker and repair worker both need to determine which pieces
are healthy, which are retrievable, and which should be replaced, but
they have been doing it in different ways in different code, which has
been the cause of bugs. The same term could have very similar but subtly
different meanings between the two, causing much confusion.
With this change, the piece- and node-classification logic is
consolidated into one place within the satellite/repair package, so that
both subsystems can use it. This ought to make decision-making code more
concise and more readable.
The consolidated classification logic has been expanded to create more
sets, so that the decision-making code does not need to do as much
precalculation. It should now be clearer in comments and code that a
piece can belong to multiple sets arbitrarily (except where the
definition of the sets makes this logically impossible), and what the
precise meaning of each set is. These sets include Missing, Suspended,
Clumped, OutOfPlacement, InExcludedCountry, ForcingRepair,
UnhealthyRetrievable, Unhealthy, Retrievable, and Healthy.
Some other side effects of this change:
* CreatePutRepairOrderLimits no longer needs to special-case excluded
countries; it can just create as many order limits as requested (by
way of len(newNodes)).
* The repair checker will now queue a segment for repair when there are
any pieces out of placement. The code calls this "forcing a repair".
* The checker.ReliabilityCache is now accessed by way of a GetNodes()
function similar to the one on the overlay. The classification methods
like MissingPieces(), OutOfPlacementPieces(), and
PiecesNodesLastNetsInOrder() are removed in favor of the
classification logic in satellite/repair/classification.go. This
means the reliability cache no longer needs access to the placement
rules or excluded countries list.
Change-Id: I105109fb94ee126952f07d747c6e11131164fadb
When we do `satellite run api --placement '...'`, the placement rules are not parsed well.
The problem is based on `viper.AllSettings()`, and the main logic is sg. like this (from a new unit test):
```
r := ConfigurablePlacementRule{}
err := r.Set(p)
require.NoError(t, err)
serialized := r.String()
r2 := ConfigurablePlacementRule{}
err = r2.Set(serialized)
require.NoError(t, err)
require.Equal(t, p, r2.String())
```
All settings evaluates the placement rules in `ConfigurablePlacementRules` and stores the string representation.
The problem is that we don't have proper `String()` implementation (it prints out the structs instead of the original definition.
There are two main solutions for this problem:
1. We can fix the `String()`. When we parse a placement rule, the `String()` method should print out the original definition
2. We can switch to use pure string as configuration parameter, and parse the rules only when required.
I feel that 1 is error prone, we can do it (and in this patch I added a lot of `String()` implementations, but it's hard to be sure that our `String()` logic is inline with the parsing logic.
Therefore I decided to make the configuration value of the placements a string (or a wrapper around string).
That's the main reason why this patch seems to be big, as I updated all the usages.
But the main part is in beginning of the `placement.go` (configuration parsing is not a pflag.Value implementation any more, but a separated step).
And `filter.go`, (a few more String implementation for filters.
https://github.com/storj/storj/issues/6248
Change-Id: I47c762d3514342b76a2e85683b1c891502a0756a
Currently each testplanet test is running ranged loop no matter if
it's used or not. This is small change with some benefits like:
* saves some cpu cycles
* less log entries
* ranged loop won't interfere with other systems
Change have no big impact on tests execration but I believe it's nice to
have.
Change-Id: I731846bf625cac47ed4f3ca3bc1d1a4659bdcce8
as GetParticipatingNodes and GetNodes, respectively.
We now want these functions to include offline and suspended nodes as
well, so that we can force immediate repair when pieces are out of
placement or in excluded countries. With that change, the old names no
longer made sense.
Change-Id: Icbcbad43dbde0ca8cbc80a4d17a896bb89b078b7
As I learned, the `Include` supposed to communicate that some internal change also "included" to the filters during the check -> filters might be stateful.
But it's not the case any more after 552242387, where we removed the only one stateful filter.
Change-Id: I7c36ddadb2defbfa3b6b67bcc115e4427ba9e083
This patch is a oneliner: rangedloop checker should check the subnets only if it's not turned off with placement annotation.
(see in satellite/repair/checker/observer.go).
But I didn't find any unit test to cover that part, so I had to write one, and I prefered to write it as a unit test not an integration test, which requires a mock repair queue (observer_unit_test.go mock.go).
Because it's small change, I also included a small change: creating a elper method to check if AutoExcludeSubnet annotation is defined
Change-Id: I2666b937074ab57f603b356408ef108cd55bd6fd
When we check the availability of the pieces, we do:
```
result.NumUnhealthyRetrievable = len(result.ClumpedPiecesSet) + len(result.OutOfPlacementPiecesSet)
// + some magic if there are overlaps between them
numHealthy := len(pieces) - len(piecesCheck.MissingPiecesSet) - piecesCheck.NumUnhealthyRetrievable
```
This works only if OutOfPlacementPieceSet doesn't contain the offline nodes (which are already included in MissingPieceSet).
But `result.OutOfPlacementPieces.Set` should include all the nodes (even offline), as in case of lucky conditions, we are able to remove those pieces from DB.
The solution is to remove all offline nodes from `NumUnhealthyRetrievable`.
Change-Id: I90baa0396352dd040e1e1516314b3271f8712034
This patch fixes the node tag based placement of rangedloop/repairchecker + repair process.
The main change is just adding the node tags for Reliable and KnownReliabel database calls + adding new tests to prove, it works.
https://github.com/storj/storj/issues/6126
Change-Id: I245d654a18c1d61b2c72df49afa0718d0de76da1
There are cases when we would like to override the default placement=0 rule.
For example when we would like to exclude tagged nodes from the selection (by default).
Therefore we couldn't use a shortcut any more, we should always check the placement rules, even if we use placement=0.
TODO: we need to update common, and rename `EveryCountry` to `DefaultPlacement`, just to avoid confusion.
https://github.com/storj/storj/issues/6126
Change-Id: Iba6c655bd623e04351ea7ff91fd741785dc193e4
This feature flag was disabled by default to test it slowly. Its enabled
for some time on one production satellite and test satellites without
any issue. We can enable it by default in code.
Change-Id: If9c36895bbbea12bd4aefa30cb4df912e1729e4c
Sometimes DownloadSelectionCache doesn't keep up with all node
placement changes we are doing during this test.
Change-Id: Idbda6511e3324b560cee3be85f980bf8d5b9b7ef
Currently, we have issue were while counting unhealthy pieces we are
counting twice piece which is in excluded country and is outside segment
placement. This can cause unnecessary repair.
This change is also doing another step to move RepairExcludedCountryCodes
from overlay config into repair package.
Change-Id: I3692f6e0ddb9982af925db42be23d644aec1963f
placement.AllowedCountry is the old way to specify placement, with the new approach we can use a more generic (dynamic method), which can check full node information instead of just the country code.
The 90% of this patch is just search and replace:
* we need to use NodeFilters instead of placement.AllowedCountry
* which means, we need an initialized PlacementRules available everywhere
* which means we need to configure the placement rules
The remaining 10% is the placement.go, where we introduced a new type of configuration (lightweight expression language) to define any kind of placement without code change.
Change-Id: Ie644b0b1840871b0e6bbcf80c6b50a947503d7df
All the files in uploadselection are (in fact) related to generic node selection, and used not only for upload,
but for download, repair, etc...
Change-Id: Ie4098318a6f8f0bbf672d432761e87047d3762ab
ReliabilityCache will be now using refactored overlay Reliable method.
This method will provide more info about nodes (e.g. country code) and
with this we are able to add two dedicated methods to classify pieces:
* OutOfPlacementPieces
* PiecesNodesLastNetsInOrder
With those new method we will fix issue where offline but reliable node
won't be checked for clumped pieces and off placement pieces.
https://github.com/storj/storj/issues/5998
Change-Id: I9ffbed9f07f4881c9db3bd0e5f0412f1a418dd82
Currently we are using Reliable to get missing pieces for repair
checker. The issue is that now checker is looking at more things than
just missing pieces (clumped/off, placement pieces) and using only node
ID is not enough. We have issue where we are skipping offline nodes from
clumped and off placement pieces check.
Reliable was refactored to get data (e.g. country, lastNet) about all
reliable nodes. List is split into online and offline. This data will be
cached for quick use by repair checker. It will be also possible to
check nodes metadata like country code or lastNet.
We are also slowly moving `RepairExcludedCountryCodes` config from
overlay to repair which makes more sens for it.
This this first part of changes.
https://github.com/storj/storj/issues/5998
Change-Id: If534342488c0e440affc2894a8fbda6507b8959d
We use two different Node types in `overlay` and `uploadnodeselection` and converting back and forth.
Using the same object would allow us to use a unified node selection interface everywhere.
Change-Id: Ie71e29d60184ee0e5b4547eb54325f09c418f73c
At the moment segment repairer is skipping offline nodes in checks like
clumped pieces and off placement pieces. This change is fixing this
problem using new version of KnownReliable method. New method is
returning both online and offline nodes. Provided data can be used to
find clumped and off placement pieces.
We are not using DownloadSelectionCache anymore with segment repairer.
https://github.com/storj/storj/issues/5998
Change-Id: I236a1926e21f13df4cdedc91130352d37ff97e18
When pieces fail an audit (hard fail, meaning the node acknowledged it
did not have the piece or the piece was corrupted), we will now remove
those pieces from the segment.
Previously, we did not do this, and some node operators were seeing the
same missing piece audited over and over again and losing reputation
every time.
This change will include both verification and reverification audits. It
will also apply to pieces found to be bad during repair, if
repair-to-reputation reporting is enabled.
Change-Id: I0ca7af7e3fecdc0aebbd34fee4be3a0eab53f4f7
Additional test case to cover situation where we are trying to
repair segment with specific placement set. We need to be sure
that segment won't be repaired into nodes that are outside
segment placement, even if that means that repair will fail.
Change-Id: I99d238aa9d9b9606eaf89cd1cf587a2585faee91
This change makes dial timeout configurable and change it also from
defatul 20s to 5s. Main motivation is that during repair we often loose
lots of time to dial which eventually will fail. New timeout should be
still enough to dial but we will move forward quicker to next node if
that one will fail.
Timeout is also applied directly as context timeout in case we will
use noise of tcp fast open one day.
Change-Id: I021bf459af49b11241e314fa1a7887c81d5214ea
We missed to set placement as a part of selection request. It can case
uploading repaired data out of specified placement.
I will provide test as a separate change.
Change-Id: I4efe67f2d5f545a1d70e831e5d297f0977a4eed1
We were reusing a slice to save on allocations, but it turns out the
function using it was being called in multiple goroutines at the same
time.
This is definitely a problem with repairer/segments.go. I'm not 100%
sure if it also is a problem with checker/observer.go, but I'm making
the change there as well to be on the safe side for now.
Repair workers only ran with this bug on testing satellites, and it
looks like the worst that could have happened was that we repaired
pieces off of well-behaved, non-clumped, in-placement nodes by mistake.
Change-Id: I33c112b05941b63d066caab6a34a543840c6b85d
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.
Fixes https://github.com/storj/storj/issues/5895
Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
Segment repairer should take into account segment 'placement' field
and remove or repair pieces from nodes that are outside this placement.
In case when after considering pieces out of placement we are still above
repair threshold we are only updating segment pieces to remove
problematic pieces. Otherwise we are doing regular repair.
https://github.com/storj/storj/issues/5896
Change-Id: I72b652aff2e6b20be3ac6dbfb1d32c2840ce3d59
rather than only logging the last_nets we see in clumpedPieces, this
will run through all the last_nets and log any that have more than one
node. This should have the same outcome, except the counts will be 1
higher (because FindClumpedPieces won't include the first node found in
a clumped network, and this will).
This should be quite a bit faster.
Change-Id: I6a7b2fd387e98963d5295c9ecfde80f2e1ee3b7a
We were using the UploadSelectionCache previously, which does _not_ have
all nodes, or even all online nodes, in it. So all nodes with less than
MinimumVersion, or with less than MinimumDiskSpace, or nodes suspended
for unknown audit errors, or nodes that have started graceful exit, were
all missing, and ended up having empty last_nets. Even with all that,
I'm kind of surprised how many nodes this involved, but using the upload
selection cache was definitely wrong.
This change uses the download selection cache instead, which excludes
nodes only when they are disqualified, gracefully exited (completely),
or offline.
Change-Id: Iaa07c988aa29c1eb05796ac48a6f19d69f5826c1
It seems that the "what pieces are clumped" code does not work right, so
this logic is causing repair overload or other repair failures.
Hide it behind a flag while we figure out what is going on, so that
repair can still work in the meantime.
Change-Id: If83ef7895cba870353a67ab13573193d92fff80b
Clumped segments (segments with multiple pieces on the same subnet) may
need repair, but the clumped pieces are considered retrievable and we
don't need to call such segments irreparable.
We do want to know where they're coming from, though, if we can, because
we are seeing more than expected.
Change-Id: I41863b243f4bb007ef8929191a3fde1562565ef9
The query for GetNodesNetworkInOrder is causing far too much load on the
database. Since it is not critical that the repair checker have
perfectly up-to-date node network information, we can use a cache
instead.
Change-Id: I07ad45bfdeb46529da093941a06c2da8a00ce878