Commit Graph

56 Commits

Author SHA1 Message Date
Michal Niewrzal
2b2bca8e81 satellite/accounting/tally: save tallies in a batches
Because we are saving all tallies as a single SQL statement we finally
reached maximum message size. With this change we will call SaveTallies multiple times in batches.

https://github.com/storj/storj/issues/5977

Change-Id: I0c7dd27779b1743ede66448fb891e65c361aa3b0
2023-06-22 17:02:26 +00:00
Michal Niewrzal
2e0b687581 satellite/metainfo: drop piecedeletion package
We are not using this package anymore.

Change-Id: If32315d43d73c8deb096e93cb43c03881bd9aad1
2023-06-14 11:19:21 +00:00
Michal Niewrzal
337eb9be6a satellite/repair/checker: put into queue segment off placement
Checker when qualifying segment for repair is now looking at pieces
location and if they are outisde segment placement puts them into
repair queue.

Fixes https://github.com/storj/storj/issues/5895

Change-Id: If0d941b30ad94c5ef02fb1a03c7f3d04a2df25c7
2023-06-05 15:53:49 +00:00
Michal Niewrzal
c0e7f463fe satellite/metabase: remove segmentsloop package
Last change to remove segments loop from codebase.

https://github.com/storj/storj/issues/5237

Change-Id: I77b12911b6b4e390a7385e6e8057c7587e74b70a
2023-05-18 19:08:29 +00:00
paul cannon
915f3952af satellite/repair: repair pieces on the same last_net
We avoid putting more than one piece of a segment on the same /24
network (or /64 for ipv6). However, it is possible for multiple pieces
of the same segment to move to the same network over time. Nodes can
change addresses, or segments could be uploaded with dev settings, etc.
We will call such pieces "clumped", as they are clumped into the same
net, and are much more likely to be lost or preserved together.

This change teaches the repair checker to recognize segments which have
clumped pieces, and put them in the repair queue. It also teaches the
repair worker to repair such segments (treating clumped pieces as
"retrievable but unhealthy"; i.e., they will be replaced on new nodes if
possible).

Refs: https://github.com/storj/storj/issues/5391
Change-Id: Iaa9e339fee8f80f4ad39895438e9f18606338908
2023-04-06 17:34:25 +00:00
paul cannon
a66503b444 satellite/audit: Begin using piecewise reverifications
This commit pulls the big switch! We have been setting up piecewise
reverifications (the workers for which can be scaled independently of
the core) for several commits now, and this commit actually begins
making use of them.

The core of this commit is fairly small, but it requires changing the
semantics in all the tests that relate to reverifications, so it ends up
being a large change. The changes to the tests are mostly mechanical and
repetitive, though, so reviewers needn't worry much.

Refs: https://github.com/storj/storj/issues/5230
Change-Id: Ibb421cc021664fd6e0096ffdf5b402a69b2d6f18
2022-12-16 14:21:13 +00:00
Erik van Velzen
b574ee5e6d satellite/metabase/rangedloop: service skeleton
Create skeleton for multi-threaded segment loop, peer, cmd command for rangedloop.

Change-Id: I52c78a313f15070d43207c52ea94e53169821654
2022-11-22 15:21:41 +02:00
paul cannon
802ff18bd8 satellite/audit: better handling of piece fetch errors
We have an alert on `not_enough_shares_for_audit` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the audit framework to act in that way.

Instead, in this change, we add three new metrics,
`audit_not_enough_nodes_online`, `audit_not_enough_shares_acquired`, and
`audit_suspected_network_problem`. When an audit fails, and emits
`not_enough_shares_for_audit`, we will now determine whether it looks
like we are having network problems (most errors are connection
failures, possibly also some successful connections which subsequently
time out) or whether something else has happened.

After this is deployed, we can remove the alert on
`not_enough_shares_for_audit` and add new alerts on
`audit_not_enough_nodes_online` and `audit_not_enough_shares_acquired`.
`audit_suspected_network_problem` does not need an alert.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: Ibb256bc19d2578904f71f5229111ac98e5212fcb
2022-09-28 17:02:06 +00:00
paul cannon
7f1cad6faf satellite/repair: better handling of piece fetch errors
We have an alert on `repair_too_many_nodes_failed` which fires too
frequently. Every time so far, it has been because of a network blip of
some nature on the satellite side.

Satellite operators are expected to have other means in place for
alerting on network problems and fixing them, so it's not necessary for
the repair framework to act in that way.

Instead, in this change, we change the way that
`repair_too_many_nodes_failed` works. When a repair fails, we collect
piece fetch errors by type and determine from them whether it looks like
we are having network problems (most errors are connection failures,
possibly also some successful connections which subsequently time out)
or whether something else has happened.

We will now only emit `repair_too_many_nodes_failed` when the outcome
does not look like a network failure. In the network failure case, we
will instead emit `repair_suspected_network_problem`.

Refs: https://github.com/storj/storj/issues/4669

Change-Id: I49df98da5df9c606b95ad08a2bdfec8092fba926
2022-09-23 09:35:06 +00:00
Vitalii
ad37ea4518 satellite/{web, console}: login captcha implemented
Implemented Recaptcha and Hcaptcha for login screen.
Slightly refactored registration page implementation.
Made 2 different login/registration captcha configs on server side to easily swap between captchas independently.

Issue: https://github.com/storj/storj/issues/4982

Change-Id: I362bd5db2d59010e90a22301893bc3e1d860293a
2022-08-03 23:02:27 +00:00
Cameron
55821605e8 satellite/console: add monkit metrics around user registraion/login
github issue: https://github.com/storj/storj/issues/4807

Change-Id: Id56ec73ec91b07b639b8011f0f916b4adbb01be6
2022-05-26 10:44:47 -04:00
Cameron Ayer
28cb690618 satellite/audit: log error and increment metric if shares cannot be verified
If we encounter an error during the infectious error correction, we just
add it to the errlist to be logged at the worker level.
We want to make sure we know about this if it happens. Give it its own
error log and increment a monkit metric.

Change-Id: Ie5946ae3cd97b766e3099af8ce160a686135ee27
2021-08-27 15:28:16 +00:00
Michał Niewrzał
011b944382 satellite/metrics: fix metrics for total inline/remote bytes and segments
Change-Id: I567ce127590a4712cab296d28a19838e3a632021
2021-07-30 18:19:51 +02:00
Michał Niewrzał
55d7bcc59b satellite/metabase/segmentloop: don't shutdown satellite on loop error
We made decision to avoid satellite shutdown when segment loop
will return error. Loop still can reeturn error but it will be logged
and we will make monitoring/alert around that error.

Change-Id: I6aa8e284406edf644a09d6b1fe00c3155c5430c9
2021-07-30 06:49:10 +00:00
Michał Niewrzał
b12d29935a satellite/metabase: remove metaloop package
We moved everything to segment loop so we can now
remove metaloop from code.

Change-Id: I9bd8d2349e5638d7cdad50f2f313f9bd89a8165c
2021-07-22 13:00:45 +00:00
Cameron Ayer
373ba8fd27 satellite/repair/repairer: metrics for repair bytes uploaded and downloaded
Change-Id: Icb0850692ecc155f6c8169edf1b045b2b546ff48
2021-07-21 09:23:19 +00:00
Michał Niewrzał
27a714e8b0 satellite/accounting/tally: use objects iterator instead metaloop
Bucket tally calculation will be removed from metaloop and will
use metabase objects iterator directly.

At the moment only bucket tally needs objects so it make no sense
to implement separate objects loop.

Change-Id: Iee60059fc8b9a1bf64d01cafe9659b69b0e27eb1
2021-07-20 15:52:18 +00:00
Egon Elbre
4031336cbd private/lifecycle: monitor unexpected shutdowns
Change-Id: I9af3a95a1b60c1572cd57bb5e9539d60aa0337bb
2021-06-25 19:26:23 +03:00
Michał Niewrzał
053e58b683 satellite/metabase: add segmentloop service
We want to move some of current metainfo loop observers to
segment loop. This change adds new service, similar to metainfo
loop but which is iterating only over segments.

Change-Id: I67f7f461781723a4476e2b83377f31736d7c4870
2021-06-01 11:15:07 +00:00
Egon Elbre
6307875203 monkit: fix monkit lock
Previously check-monitoring ignored ScopeNamed.

Change-Id: I3d0d472e722cb19a65b471a914f5124cd224fc34
2021-04-23 06:36:40 +00:00
Egon Elbre
4c9ed64f75 satellite/metabase/metaloop: move loop under metabase
Currently the loop handling is heavily related to the metabase rather
than metainfo.

metainfo over time has become related to the "public API" for accessing
the metabase data.

Currently updates monkit.lock, because monkit monitoring does not handle
ScopeNamed correctly. Needs a followup change to monitoring check.

Change-Id: Ie50519991d718dfb872ec9a0176a82e732c97584
2021-04-22 12:58:09 +03:00
JT Olio
86c41790ce satellite/metainfo/metaloop: add observability
we want to know a lot more about what's going on during
the operation of the metainfo loop. this patchset adds
more instrumentation to previously unmonitored but
interesting functions, and adds metrics that keep track
of how far through a specific loop we are. it also
adds mon:lock annotations, especially to the metainfo
loop run task, which recently changed, silently broke
some queries, and thus failed to alert us to spiking
run time issues.

Change-Id: I4358e2f2293d8ebe30eef497ba4e423ece929041
2021-03-30 14:32:05 -06:00
Kaloyan Raev
6f3d0c4ad5 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/repair/repair_test.go
	satellite/repair/repairer/segments.go

Change-Id: Ie51a56878bee84ad9f2d31135f984881a882e906
2021-02-02 19:19:04 +02:00
Kaloyan Raev
339d1212cd satellite/repair: don't remove expired segments from repair queue
It's impossible to time correctly this check. The segment may expire
just at the time we upload the repaired pieces to new storage nodes.
They will reject this as expired and the repair will fail.

Also, we penalize storage nodes with audit failure only if they fail
piece hash verification, i.e. return incorrect data, but only if they
have already deleted the piece.

So, it would be best if the repair service does not care about object
expiration at all. This is a responsibility of another service.

Removing this check will also simplify how we migrate this code
correctly to the metabase.

Change-Id: I09f7b372ae2602daee919a8a73cd0475fb263cd2
2021-02-02 16:13:01 +00:00
Kaloyan Raev
d0612199f0 Merge remote-tracking branch 'origin/main' into multipart-upload
Conflicts:
	go.mod
	go.sum
	satellite/metainfo/config.go
	satellite/metainfo/metainfo_test.go

Change-Id: I95cf3c1d020a7918795b5eec63f36112fdb86749
2021-02-01 14:32:12 +02:00
Isaac Hess
c92bda7e75 tally monkit: change location to monitor piecesize
When we observed the value for total piecesizes stored in the network,
we were doing it after converting them to byte-hours, rather than using
the actual piece sizes. This fixes that issue.

Change-Id: I1564d21b519f70eb59f298d97dbd777baf127723
2021-01-26 15:37:02 +00:00
Michał Niewrzał
ad3e3a38c5 Merge 'main' branch
Change-Id: Ia0db1b1f9ef3e0671d3f2208881b0abc3064e200
2021-01-04 12:13:45 +01:00
Rafael Gomes
8b2e4bfa7e satellite/metainfo/piecedeletion Remove spaces from metrics.
Change-Id: Iaf1d8a96a43087f2fcc579347f581e8a78a0fb58
2020-12-30 14:27:39 -03:00
Kaloyan Raev
2bb010e7c5 cmd: remove segment reaper
It was designed to detect and remove zombie segments in the PointerDB.
This tool should be not relevant with the MetabaseDB anymore.

Change-Id: I112552203b1329a5a659f69a0043eb1f8dadb551
2020-12-14 09:36:37 +00:00
Moby von Briesen
d75e4be11f satellite/{accounting, contact}: Remove periods and spaces from metrics.
Change-Id: I84179c2931293e3a1eb0ff8050416d25e481ce07
2020-12-03 15:33:01 +00:00
Moby von Briesen
0ec685b173 satellite/{satellitedb, repair/{queue, checker}}: Use new column "segmentHealth" instead of "numHealthy" in injured segments queue
We plan to add support for a new Reed-Solomon scheme soon, but our
repair queue orders segments by least number of healthy pieces first.
With a second RS scheme, fewer healthy pieces will not necessarily
correlate to lower health.

This change just adds the new column in a migration. A separate change
will add the new health function.

Right now, since we only support one RS scheme, behavior will not
change. Number of healthy pieces is being inserted as "segment health"
until the new health function is merged.

Segment health is calculated with a new priority function created in
commit 3e5640359. In order to use the function, a new config value is
added, called NodeFailureRate, representing the approximate probability
of any individual node going down in the duration of one checker run.

Change-Id: I51c4202203faf52528d923befbe886dbf86d02f2
2020-11-16 21:18:09 +00:00
Cameron Ayer
da9f1f0611 satellite/repair: add monkit counter for segments below minimum required
The current monkit reporting for "remote_segments_lost" is not usable for
triggering alerts, as it has reported no data. To allow alerting, two new
metrics "checker_segments_below_min_req" and "repairer_segments_below_min_req"
will increment by zero on each segment unless it is below the minimum
required piece count. The two metrics report what is found by the checker
and the repairer respectively.

Change-Id: I98a68bb189eaf68a833d25cf5db9e68df535b9d7
2020-11-11 12:48:23 +00:00
Moby von Briesen
7c3afe164b satellite/overlay: uncomment dq for offline and disable with feature flag
Change-Id: Ib39e2be32e880b822a94eddfb81af99a38843a27
2020-10-16 12:55:16 +00:00
Cameron Ayer
c2525ba2b5 satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.

solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.

Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
2020-09-29 20:38:22 +00:00
Cameron Ayer
3e343b683b cmd/segment-reaper: add metrics for zombie segments count
Change-Id: I106c6795946283165ba3de8465e5898346da1a3f
2020-08-26 18:42:59 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Isaac Hess
67a292d135 satellite/satellitedb: Monitor node tallies
We are adding a monkit evaluation for the total sum of data stored on
the nodes before it is inserted into the database. This will give us a
time-series history of total data stored so we can see it change over
time.

Change-Id: I41145a2d7a09c8e63b42ae578bd081035b60e529
2020-07-17 10:21:42 -06:00
Moby von Briesen
0b109c32e4 storagenode/piecestore/usedserials: add monkit metric for serials that are randomly deleted
This will give storagenode operators a better idea of whether the memory
allocated to the usedserials store is sufficient.

Change-Id: I5c30f2e39473a573f43409511ad9e2e32680479c
2020-06-09 17:04:37 -04:00
Moby von Briesen
290c006a10 satellite/repair/{checker,queue}: add metric for new segments added to repair queue
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration

Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
2020-05-27 06:23:47 +00:00
Moby von Briesen
178aa8b5e0 satellite/{metainfo,repair}: Delete expired segments from metainfo
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue

Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
2020-04-22 13:02:31 +00:00
paul cannon
ba5991dc86 satellite/repair: add monitoring for remote_segments_healthy_percentage
Change-Id: I6ad29fe1a947ac19d15e40ea33164a510eb33d4f
2020-03-17 17:45:59 +00:00
Moby von Briesen
8b72181a1f satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.

Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
2020-03-16 20:29:26 +00:00
paul cannon
79553059cb satellite/repair: put irreparable segments in irreparableDB
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).

Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.

This change also better differentiates some error cases from Repair()
for monitoring purposes.

Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
2020-03-09 21:45:16 +00:00
Jennifer Johnson
1c1750e6be removes bandwidth limiting
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.

Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.

Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
2020-03-04 14:04:00 +00:00
Moby von Briesen
6043d01c90 satellite/audit/verifier: add metric for number of successfully downloaded shares
Change-Id: Ia4f1dc6e088db802e340aaecf80cc7ef6dc237a4
2020-02-27 14:33:59 +00:00
Moby von Briesen
d5540c89a1 satellite/repair/checker: add monkit metrics for segments immediately above repair threshold
Record counts for segments at health=rt+1 through health=rt+5 for every checker
iteration.

Change-Id: I2a00c0bc34d17beb21cacdeab4dac77f755faefe
2020-02-26 20:27:15 +00:00
Ethan
208c05e3db Add metrics to track rate limit.
Add monkit metric for the rate-limit when the rate limit is hit
Logs warning with projectID

https://storjlabs.atlassian.net/browse/SM-165

Change-Id: I352dc40006021990d1bc66a999f62bbf8deb54db
2020-02-11 14:02:12 +00:00
Moby von Briesen
006a2824ba satellite/repair: lock monkit stats in checker and repairer
Change-Id: Ia10fc8da0177389a500359ce51d21a5806f3f7b1
2020-01-30 14:09:56 +00:00
Egon Elbre
082ec81714
uplink: move to storj.io/uplink (#3746) 2020-01-08 15:40:19 +02:00
Yingrong Zhao
7af42e3c10
satellite/metainfo, satellite/repair, uplink/eestream: add metric for download failed due to not enough pieces available (#3665) 2019-12-04 16:24:36 -05:00