We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.
This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).
Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
This commit pulls the big switch! We have been setting up piecewise
reverifications (the workers for which can be scaled independently of
the core) for several commits now, and this commit actually begins
making use of them.
The core of this commit is fairly small, but it requires changing the
semantics in all the tests that relate to reverifications, so it ends up
being a large change. The changes to the tests are mostly mechanical and
repetitive, though, so reviewers needn't worry much.
Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18
We have an alert on `not_enough_shares_for_audit` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.
Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the audit framework to act in that way.
Instead, in this change, we add three new metrics,
`audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and
`audit_suspected_network_problem`. When an audit fails, and emits
`not_enough_shares_for_audit`, we will now determine whether it looks
like we are having network problems (most errors are connection
failures, possibly also some successful connections which subsequently
time out) or whether something else has happened.
After this is deployed, we can remove the alert on
`not_enough_shares_for_audit` and add new alerts on
`audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`.
`audit_suspected_network_problem` does not need an alert.
Refs: https://github.com/storj/storj/issues/4669
Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.
Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.
Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.
We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.
Refs: https://github.com/storj/storj/issues/4669
Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
Implemented Recaptcha and Hcaptcha for login screen.
Slightly refactored registration page implementation.
Made 2 different login/registration captcha configs on server side to easily swap between captchas independently.
Issue: https://github.com/storj/storj/issues/4982
Change-Id: I362bd5db2d59010e90a22301893bc3e1d860293a
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.
Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
We made decision to avoid satellite shutdown when segment loop
will return error. Loop still can reeturn error but it will be logged
and we will make monitoring/alert around that error.
Change-Id: I6aa8e284406edf644a09d6b1fe00c3155c5430c9
Bucket tally calculation will be removed from metaloop and will
use metabase objects iterator directly.
At the moment only bucket tally needs objects so it make no sense
to implement separate objects loop.
Change-Id: Iee60059fc8b9a1bf64d01cafe9659b69b0e27eb1
We want to move some of current metainfo loop observers to
segment loop. This change adds new service, similar to metainfo
loop but which is iterating only over segments.
Change-Id: I67f7f461781723a4476e2b83377f31736d7c4870
Currently the loop handling is heavily related to the metabase rather
than metainfo.
metainfo over time has become related to the "public API" for accessing
the metabase data.
Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.
Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
we want to know a lot more about what's going on during
the operation of the metainfo loop. this patchset adds
more instrumentation to previously unmonitored but
interesting functions, and adds metrics that keep track
of how far through a specific loop we are. it also
adds mon:lock annotations, especially to the metainfo
loop run task, which recently changed, silently broke
some queries, and thus failed to alert us to spiking
run time issues.
Change-Id: I4358e2f2293d8ebe30eef497ba4e423ece929041
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.
Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.
So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.
Removing this check will also simplify how we migrate this code
correctly to the metabase.
Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
When we observed the value for total piecesizes stored in the network,
we were doing it after converting them to byte-hours, rather than using
the actual piece sizes. This fixes that issue.
Change-Id: I1564d21b519f70eb59f298d97dbd777baf127723
It was designed to detect and remove zombie segments in the PointerDB.
This tool should be not relevant with the MetabaseDB anymore.
Change-Id: I112552203b1329a5a659f69a0043eb1f8dadb551
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.
This change just adds the new column in a migration. A separate change
will add the new health function.
Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.
Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.
Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.
Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.
solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.
Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
We are adding a monkit evaluation for the total sum of data stored on
the nodes before it is inserted into the database. This will give us a
time-series history of total data stored so we can see it change over
time.
Change-Id: I41145a2d7a09c8e63b42ae578bd081035b60e529
This will give storagenode operators a better idea of whether the memory
allocated to the usedserials store is sufficient.
Change-Id: I5c30f2e39473a573f43409511ad9e2e32680479c
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration
Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue
Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.
Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).
Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.
This change also better differentiates some error cases from Repair()
for monitoring purposes.
Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.
Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.
Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
Add monkit metric for the rate-limit when the rate limit is hit
Logs warning with projectID
https://storjlabs.atlassian.net/browse/SM-165
Change-Id: I352dc40006021990d1bc66a999f62bbf8deb54db
* skip unknown errors (wip)
* add tests to make sure nodes that time out are added to containment
* add bad blobs store
* call "Skipped" "Unknown"
* add tests to ensure unknown errors do not trigger containment
* add monkit stats to lockfile
* typo
* add periods to end of bad blobs comments