Recently we applied this optimization to metrics observer and time
used by its method dropped from 12m to 3m for us1 (220m segments).
It looks that it make sense to apply the same code to all observers.
Change-Id: I05898aaacbd9bcdf21babc7be9955da1db57bdf2
Implement a buffer for inserting repair items into the queue in a batch.
Part of https://github.com/storj/storj/issues/4727
Change-Id: I718472b2f2b1f4993c3d6f15c44923776407155a
TestSegmentInExcludedCountriesRepair and TestSegmentInExcludedCountriesRepairIrreparable are using 20 storage nodes.
This change make them use 7 by adjusting the test redundancy scheme.
Change-Id: I1a44aa8b997d6edcc9a3305fdd0dac57e4d525b5
To save load on DNS servers, the repair code first tries to dial the
last known good ip and port for a node, and then falls back to a DNS
lookup only if we fail to connect to the last known good ip and port.
However, it looks like we are seeing errors during the client stream
Close() call (probably due to quic-go code), and those are classified
the same as errors encountered during Dial. The repairer code sees this
error, assumes that we failed to contact the node, and retries- but
since we did actually succeed in connecting the first time around, this
results in submitting the same order limit (with the same serial number)
to the storage node, which (rightfully) rejects it.
So together with change I055c186d5fd4e79560f67763175bc3130b9bc7d2 in
storj/uplink, this should avoid the double submission and avoid dinging
nodes' suspension scores unfairly.
See https://github.com/storj/storj/issues/4687.
Also, moving the testsuite directory check up above check-monkit in the
Jenkins Lint task, so that a non-tidy testsuite/go.mod can be recognized
and handled before everything breaks weirdly and seemingly randomly
later on.
Change-Id: Icb2b05aaff921d0af6aba10e450ac7e0a7bb2655
We don't need to have every single test for both, only one for
each should be sufficient. For all other tests it doesn't matter
which one we use.
Change-Id: I9962206a4ee025d367332c29ea3e6bc9f0f9a1de
For nodes in excluded areas, we don't necessarily want to remove them
from the pointer, but we do want to increase the number of pieces in the
segment in case those excluded area nodes go down. To do that, we
increase the number of pieces repaired by the number of pieces in
excluded areas.
Change-Id: I0424f1bcd7e93f33eb3eeeec79dbada3b3ea1f3a
Add a RepairExcludedCountryCodes config flag for overlay for providing a list of country codes to exclude nodes from target repair selection.
Mark segments with less than repairThreshold pieces in countries not in the RepairExcludedCountryCodes as not healthy.
With this change, the repair process is not affected. The segment will be removed from the repair queue by the repairer.
Another change will handle the logic at the repairer level.
Fixes https://github.com/storj/team-metainfo/issues/95
Change-Id: I9231b32de117a116488de055a3e94efcabb46e81
The "satellite fetch-pieces" command allows a satellite operator to
fetch as many pieces of a segment as possible, along with their
original order limits and hashes as provided by the storage nodes. The
fetched pieces and associated info will be stored on in a specified
folder as they are, rather than being RS-decoded or decrypted.
It is hoped that this will allow easier debugging of certain one-off
problems we've observed in the wild.
Change-Id: I42ae0e9ef0023538e42473a9be5a2460a3ac0f3a
inconsistency
The original design had a flaw which can potentially cause discrepancy
for nodes reputation status between reputations table and nodes table.
In the event of a failure(network issue, db failure, satellite failure, etc.)
happens between update to reputations table and update to nodes table, data
can be out of sync.
This PR tries to fix above issue by passing through node's reputation from
the beginning of an audit/repair(this data is from nodes table) to the next
update in reputation service. If the updated reputation status from the service
is different from the existing node status, the service will try to update nodes
table. In the case of a failure, the service will be able to try update nodes
table again since it can see the discrepancy of the data. This will allow
both tables to be in-sync eventually.
Change-Id: Ic22130b4503a594b7177237b18f7e68305c2f122
If satellite can't find enough nodes to successfully download a segment,
it probably is not the fault of storage nodes.
Change-Id: I681f66056df0bb940da9edb3a7dbb3658c0a56cb
work
Since we have changed the repair worker to also mark a node as audit
failure if they return a not found error, we should ignore expired
segments when possible
Change-Id: Ie6a677e1d7b234e93965c736d05950440236653c
This change introduce problems with server side move so
let's revert it for now. Problem was found when latest
version of storj/storj was used in uplink tests.
This reverts commit 1ef06fae99.
Change-Id: I4d4fad5d1ea04ba15ff9d7bd765f7e078e9187c2
We were using mixed types for nonce fields. Protobuf
have storj.Nonce, metabase have []byte. This change
is a refactoring to have everywere its possible only
storj.Nonce.
Change-Id: Id54bd8481f30c721cdaf3df79206d25e7cfdab55
Update repair tests to check if audit score increases for nodes
that successfully send pieces during successfull and failed repairs.
Change-Id: Ie6abbde6155ab4697d209366c9fa497e731756e9
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
When we can't complete an audit or repair, we need more information about
what happened during each individual share/piece download.
In audit, add the number of offline, unknown, contained, failed nodes to
the error log. In repair, combine the errors from each download and add
them to the error log.
Change-Id: Ic5d2a0f3f291f26cb82662bfb37355dd2b5c89ba
This change adds a NOT NULL constraint to the created_at column in the segment table.
All occurrences of CreatedAt as a pointer are changed to non pointer version (metabase, segment loop, etc)
Change-Id: I3efd476ebd1edd3327b69c9223d9edc800e1cc52
This change adds dedicated methods on metabase.Pieces to be able to add, remove pieces and also to check duplicates.
Change-Id: I21aaeff40c017c2ebe1cc85a864ae546754769cc
Error from joining loop should not restart satellite. This will be the
same error like for loop itself. In the same way we are handling joining
error for other services that are using segment loop.
Change-Id: Idf1035ef7f78462927bd23989ed8a4ee5826c49e
Sometimes we see timeouts from DNS lookups when trying to do
repair GETs. Solution: try using node's last IP and port first.
If we can't connect, retry with DNS lookup.
Change-Id: I59e223aebb436118779fb18378f6e09d072f12be
We want to use StreamID/Position to identify injured
segment. As it is hard to alter existing injuredsegments
table we are adding a new table that will replace existing
one. Old table will be dropped later.
Change-Id: I0d3b06522645013178b6678c19378ebafe485c49
This is part of metaloop refactoring. We plan to remove
irreparable at some point but there was not time for it.
Now instead refatoring it for segmentloop its just easier
to drop it.
Later we still need to drop table with migration step.
Change-Id: I270e77f119273d39a1ecdcf5e1c37a5662a29ab4
Satellites set their configuration values to default values using
cfgstruct, however, it turns out our tests don't test these values
at all! Instead, they have a completely separate definition system
that is easy to forget about.
As is to be expected, these values have drifted, and it appears
in a few cases test planet is testing unreasonable values that we
won't see in production, or perhaps worse, features enabled in
production were missed and weren't enabled in testplanet.
This change makes it so all values are configured the same,
systematic way, so it's easy to see when test values are different
than dev values or release values, and it's less hard to forget
to enable features in testplanet.
In terms of reviewing, this change should be actually fairly
easy to review, considering private/testplanet/satellite.go keeps
the current config system and the new one and confirms that they
result in identical configurations, so you can be certain that
nothing was missed and the config is all correct.
You can also check the config lock to see what actual config
values changed.
Change-Id: I6715d0794887f577e21742afcf56fd2b9d12170e
Piece hash verification failures during repair download are considered
audit failures, but we are not logging these occurrences. Now we log
them.
Change-Id: If456cebcfda6af7a659be3d1fc74448e681fb653
Currently the interface is not useful. When we need to vary the
implementation for testing purposes we can introduce a local interface
for the service/chore that needs it, rather than using the large api.
Unfortunately, this requires adding a cleanup callback for tests, there
might be a better solution to this problem.
Change-Id: I079fe4dbe297b0ae08c10081a1cea4dfbc277682
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
Currently the loop handling is heavily related to the metabase rather
than metainfo.
metainfo over time has become related to the "public API" for accessing
the metabase data.
Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.
Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
Check that the bloom filter creation date is earlier than the
metainfo loop system time used for db scanning.
Change-Id: Ib0f47c124f5651deae0fd7e7996abcdcaac98fb4
Repair checker expects to have information about CreatedAt and RepairedAt fields to calculate segment age metric.
Change-Id: I6b41df880d77133be541e14d10d91cc75759b339
At some point we might try to change original segment RS values and set Pieces according to the new values. This change adds add NewRedundancy parameter for UpdateSegmentPieces method to give ability to do that. As a part of change NewPieces are validated against NewRedundancy.
Change-Id: I8ea531c9060b5cd283d3bf4f6e4c320099dd5576
We have multipart objects so we may get multiple inline segments
sequences or no segments at all for objects.
Change-Id: Ie46ee777a2db8f18f7154e3443bb9e07ecb170f7
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.
Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.
So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.
Removing this check will also simplify how we migrate this code
correctly to the metabase.
Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
Do not insert the number of healthy pieces for segment health anymore.
Rather, insert the segment health calculated by our new priority
function.
Change-Id: Ieee7fb2deee89f4d79ae85bac7f577befa2a0c7f
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
The chief segment health models we've come up with are the "immediate
danger" model and the "survivability" model. The former calculates the
chance of losing a segment becoming lost in the next time period (using
the CDF of the binomial distribution to estimate the chance of x nodes
failing in that period), while the latter estimates the number of
iterations for which a segment can be expected to survive (using the
mean of the negative binomial distribution). The immediate danger model
was a promising one for comparing segment health across segments with
different RS parameters, as it is more precisely what we want to
prevent, but it turns out that practically all segments in production
have infinite health, as the chance of losing segments with any
reasonable estimate of node failure rate is smaller than DBL_EPSILON,
the smallest possible difference from 1.0 representable in a float64
(about 1e-16).
Leaving aside the wisdom of worrying about the repair of segments that
have less than a 1e-16 chance of being lost, we want to be extremely
conservative and proactive in our repair efforts, and the health of the
segments we have been repairing thus far also evaluates to infinity
under the immediate danger model. Thus, we find ourselves reaching for
an alternative.
Dr. Ben saves the day: the survivability model is a reasonably close
approximation of the immediate danger model, and even better, it is
far simpler to calculate and yields manageable values for real-world
segments. The downside to it is that it requires as input an estimate
of the total number of active nodes.
This change replaces the segment health calculation to use the
survivability model, and reinstates the call to SegmentHealth() where it
was reverted. It gets estimates for the total number of active nodes by
leveraging the reliability cache.
Change-Id: Ia5d9b9031b9f6cf0fa7b9005a7011609415527dc
A few weeks ago it was discovered that the segment health function
was not working as expected with production values. As a bandaid,
we decided to insert the number of healthy pieces into the segment
health column. This should have effectively reverted our means of
prioritizing repair to the previous implementation.
However, it turns out that the bandaid was placed into the code which
removes items from the irreparable db and inserts them into the repair
queue.
This change: insert number of healthy pieces into the repair queue in the
method, RemoteSegment
Change-Id: Iabfc7984df0a928066b69e9aecb6f615253f1ad2
There is a new checker field called statsCollector. This contains
a map of stats pointers where the key is a stringified redundancy
scheme. stats contains all tagged monkit metrics. These metrics exist
under the key name, "tagged_repair_stats", which is tagged with the
name of each metric and a corresponding rs scheme.
As the metainfo observer works on a segment, it checks statsCollector
for a stats corresponding to the segment's redundancy scheme. If one
doesn't exist, it is created and chained to the monkit scope. Now we can call
Observe, Inc, etc on the fields just like before, and they have tags!
durabilityStats has also been renamed to aggregateStats.
At the end of the metainfo loop, we insert the aggregateStats totals into the
corresponding stats fields for metric reporting.
Change-Id: I8aa1918351d246a8ef818b9712ed4cb39d1ea9c6
We migrated satelliteDB off of Postgres and over to CockroachDB (crdb), but there was way too high contention for the injuredsegments table so we had to rollback to Postgres for the repair queue. A couple things contributed to this problem:
1) crdb doesn't support `FOR UPDATE SKIP LOCKED`
2) the original crdb Select query was doing 2 full table scans and not using any indexes
3) the SLC Satellite (where we were doing the migration) was running 48 repair worker processes, each of which run up to 5 goroutines which all are trying to select out of the repair queue and this was causing a ton of contention.
The changes in this PR should help to reduce that contention and improve performance on CRDB.
The changes include:
1) Use an update/set query instead of select/update to capitalize on the new `UPDATE` implicit row locking ability in CRDB.
- Details: As of CRDB v20.2.2, there is implicit row locking with update/set queries (contention reduction and performance gains are described in this blog post: https://www.cockroachlabs.com/blog/when-and-why-to-use-select-for-update-in-cockroachdb/).
2) Remove the `ORDER BY` clause since this was causing a full table scan and also prevented the use of the row locking capability.
- While long term it is very important to `ORDER BY segment_health`, the change here is only suppose to be a temporary bandaid to get us migrated over to CRDB quickly. Since segment_health has been set to infinity for some time now (re: https://review.dev.storj.io/c/storj/storj/+/3224), it seems like it might be ok to continue not making use of this for the short term. However, long term this needs to be fixed with a redesign of the repair workers, possible in the trusted delegated repair design (https://review.dev.storj.io/c/storj/storj/+/2602) or something similar to what is recommended here on how to implement a queue on CRDB https://dev.to/ajwerner/quick-and-easy-exactly-once-distributed-work-queues-using-serializable-transactions-jdp, or migrate to rabbit MQ priority queue or something similar..
This PRs improved query uses the index to avoid full scans and also locks the row its going to update and CRDB retries for us if there are any lock errors.
Change-Id: Id29faad2186627872fbeb0f31536c4f55f860f23
the immediate need is to be able to move the repair queue back out
of cockroach if we can't save it.
Change-Id: If26001a4e6804f6bb8713b4aee7e4fd6254dc326
We did not test the SegmentHealth function with actual production
values, and it turns out that values such as 52 healthy, 35 minimum
result in +Inf segment health - so pretty much all segments put into the
repair queue have the same health, which means we effectively aren't
sorting by health.
This change inserts numHealthy as segment health into the database so
the segments are ordered as they were before. We need to refine the
SegmentHealth function before we can support multi RS.
Change-Id: Ief19bbfee3594c5dfe94ca606bc930f05f85ff74
Rather than having a single repair override value, we will now support
repair override values based on a particular segment's RS scheme.
The new format for RS override values is
"k/o/n-override,k/o/n-override..."
Change-Id: Ieb422638446ef3a9357d59b2d279ee941367604d
Firstly, this changes the repair functionality to return Canceled errors
when a repair is canceled during the Get phase. Previously, because we
do not track individual errors per piece, this would just show up as a
failure to download enough pieces to repair the segment, which would
cause the segment to be added to the IrreparableDB, which is entirely
unhelpful.
Then, ignore Canceled errors in the return value of the repair worker.
Apparently, when the worker returns an error, that makes Cobra exit the
program with a nonzero exit code, which causes some piece of our
deployment automation to freak out and page people. And when we ask the
repair worker to shut down, "canceled" errors are what we _expect_, not
an error case.
Change-Id: Ia3eb1c60a8d6ec5d09e7cef55dea523be28e8435
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.
Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1