Commit Graph

115 Commits

Author SHA1 Message Date
Cameron Ayer
c2525ba2b5 satellite/{repair,satellitedb}: clean up healthy segments from repair queue at end of checker iteration
Repair workers prioritize the most unhealthy segments. This has the consequence that when we
finally begin to reach the end of the queue, a good portion of the remaining segments are
healthy again as their nodes have come back online. This makes it appear that there are more
injured segments than there actually are.

solution:
Any time the checker observes an injured segment it inserts it into the repair queue or
updates it if it already exists. Therefore, we can determine which segments are no longer
injured if they were not inserted or updated by the last checker iteration. To do this we
add a new column to the injured segments table, updated_at, which is set to the current time
when a segment is inserted or updated. At the end of the checker iteration, we can delete any
items where updated_at < checker start.

Change-Id: I76a98487a4a845fab2fbc677638a732a95057a94
2020-09-29 20:38:22 +00:00
Michal Niewrzal
27a9d14e2a satellite/repair: use metabase.SegmentKey type in repair package
Another change which is a part of refactoring to replace path parameter
(string/[]byte) with key paramter (metabase.SegmentKey)

Change-Id: I617878442442e5d59bbe5c995f913c3c93c16928
2020-09-08 19:35:20 +00:00
Michal Niewrzal
9202295348 satellite/metainfo: replace ScopedPath with metabase.SegmentLocation
Change-Id: I7e89c9e8eaeae58be828a32ad47ed3028501f4c7
2020-09-04 10:06:52 +00:00
Michal Niewrzal
aa47e70f03 satellite/metainfo: use metabase.SegmentKey with metainfo.Service
Instead of using string or []byte we will be using dedicated type
SegmentKey.

Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220
2020-09-03 15:11:32 +00:00
JT Olio
b872fe52a1 satellite/repair: switch to piecestore.UploadReader
Change-Id: Ia99ad2cf5422e6ba1d98b32946740f9cadba7b6d
2020-09-01 09:26:54 -06:00
Michal Niewrzal
0604a672c1 satellite/metainfo: use metabase in loop
Change-Id: I1bb0c6fe0a762895fde950690b06f7dd9d77e178
2020-09-01 10:06:16 +00:00
Moby von Briesen
5d21e85529 satellite/audit/queue: Separate audit queue into two separate structs.
* The audit worker wants to get items from the queue and process them.
* The audit chore wants to create new queues and swap them in when the
old queue has been processed.

This change adds a "Queues" struct which handles the concurrency
issues around the worker fetching a queue and the chore swapping a new
queue in. It simplifies the logic of the "Queue" struct to its bare
bones, so that it behaves like a normal queue with no need to understand
the details of swapping and worker/chore interactions.

Change-Id: Ic3689ede97a528e7590e98338cedddfa51794e1b
2020-08-31 20:51:25 +00:00
Egon Elbre
3ca405aa97 satellite/orders: use metabase types as arguments
Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef
2020-08-28 15:52:37 +03:00
Moby von Briesen
5dfe27f175 satellite/{repair,overlay}: Use overlay NodeSelectionCache for repair uploads
This change removes the overlay function FindStorageNodesForRepair,
which skips using the node selection cache and hits the database
directly. Otherwise, it is functionally identical to
FindStorageNodesForUpload, which checks the node selection cache first.

When selecting nodes for PUT_REPAIRs, we now call
FindStorageNodesForUpload instead of FindStorageNodesForRepair to reduce
database load.

Change-Id: If34e109695b2ed2b8fb6759115bf769a3459684e
2020-08-04 12:50:12 -04:00
Moby von Briesen
76030a8237 satellite/audit/{queue,chore}: Wait for audit queue to be finished before swapping
* Do not swap the active audit queue with the pending audit queue until
the active audit queue is empty.
* Do not begin creating a new pending audit queue until the existing
pending audit queue has been swapped to the active queue.

Change-Id: I81db5bfa01458edb8cdbe71f5baeebdcb1b94317
2020-07-28 16:56:26 +00:00
Egon Elbre
44f9193404 satellite/orders: make optimal threshold multiplier into an argument
It feels weird having a repairer configuration part of order services.
Let's have a single source of truth for it.

Change-Id: I24f7c897aec80f3293f8af24876cbb6733d85a0b
2020-07-24 16:35:59 +03:00
Cameron Ayer
e14f7a3fb4 satellite/repair: update healthyPieces and unhealthyPieces after CreateGetRepairOrderLimits
Inside CreateGetRepairOrderLimits we pass in a list of healthy pieces,
but when we query node info from this list we apply the "reliable" filter
again. We sometimes end up with nodes which at first were healthy, but then
became unhealthy, and thus can be repaired, but we do not update the 'unhealthyPieces'
list with these nodes.

This causes an error, 'piece to add already exists', as we fail to remove these
pieces from the pointer before replacing them with repaired pieces.

Change-Id: I6e2445f342ac117ded30351fa7e5e523c9ec26bd
2020-07-23 13:24:46 +00:00
Egon Elbre
d8dcae3075 all: fix error checking
Change-Id: Ia0da1bbd6ce695139922f94096c2419281905e32
2020-07-16 19:13:14 +03:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
paul cannon
bbdb351e5e all: use jackc/pgx in place of lib/pq
What:

Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.

Why:

github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.

Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
2020-07-13 15:54:41 +00:00
paul cannon
4997fd55d0 satellite/repair: remove healthy from irreparabledb
Change-Id: Ia9d300d0359883f03734d0bdf204d56d6642ce34
2020-06-26 21:26:00 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Egon Elbre
410d897840 satellite: fix string(int) conversions
Change-Id: I54c6ca8c2dad3c321175f72271b7536cc2a4df09
2020-06-12 06:41:34 +00:00
Moby von Briesen
e7e69f383a satellite/repair/repairer/ec.go: Add monkit tracing for ec repairer
Adds monkit tracing for ecrepairer.downloadAndVerifyPiece and
ecrepairer.putPiece so we can get more accurate estimates of node
performance during repair.

Change-Id: Ic05025bf3c493bb3d6f5d325d090c5b7c9e5465d
2020-05-29 14:00:45 +00:00
Moby von Briesen
acf8b72cd0 satellite/repair/repairer: cut off long tail when minimum number of required uploads is met
This will speed up the Put step of repair by not waiting to time out for
a handful of slow nodes, at the expense of a slightly less durable
pointer. It will still repair to the optimal threshold, but not every
node that is selected will end up in the pointer.

Change-Id: I02a0658e3fe6fc0383f26af0f50a065b8b11a651
2020-05-28 16:25:28 -04:00
Moby von Briesen
290c006a10 satellite/repair/{checker,queue}: add metric for new segments added to repair queue
* add monkit stat new_remote_segments_needing_repair, which reports the
number of new unhealthy segments in the repair queue since the previous
checker iteration

Change-Id: I2f10266006fdd6406ece50f4759b91382059dcc3
2020-05-27 06:23:47 +00:00
Egon Elbre
bef84a5f9d storagenode: remove dependency to overlay.NodeDossier
This is the last dependency from storage node to satellite.

Change-Id: I12f7abb91e84f823ba5af126c6e2979519838612
2020-05-21 08:37:13 +03:00
Egon Elbre
941d10cbc3 private/testplanet: remove Peer.Local()
Currently storagenode depends on overlay.NodeDossier, this is the first
step in removing it.

Change-Id: I034a3f1601835f8349bd41752455022e19bcc707
2020-05-20 11:05:34 +00:00
Egon Elbre
ed627144ed all: use DialNodeURL throughout the codebase
Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c
2020-05-20 10:36:30 +00:00
Egon Elbre
ec589a8289 all: fix comments about grpc
Change-Id: Id830fbe2d44f083c88765561b6c07c5689afe5bd
2020-05-11 13:05:34 +03:00
Egon Elbre
bcd93ee375 private/testplanet: add StopNodeAndUpdate
This was commonly used and code with it can be simplified.

Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39
2020-05-08 13:02:19 +00:00
Moby von Briesen
de366537a8 satellite/satellitedb/overlaycache: fix behavior around gracefully exited nodes
Sometimes nodes who have gracefully exited will still be holding pieces
according to the satellite. This has some unintended side effects
currently, such as nodes getting disqualified after having successfully
exited.
* When the audit reporter attempts to update node stats, do not update
stats (alpha, beta, suspension, disqualification) if the node has
finished graceful exit (audit/reporter_test.go TestGracefullyExitedNotUpdated)
* Treat gracefully exited nodes as "not reputable" so that the repairer
and checker do not count them as healthy (overlay/statdb_test.go
TestKnownUnreliableOrOffline, repair/repair_test.go
TestRepairGracefullyExited)

Change-Id: I1920d60dd35de5b2385a9b06989397628a2f1272
2020-04-28 23:58:43 +00:00
Jess G
825226c98e
satellite/overlay: use node selection cache for uploads (#3859)
* satellite/overlay: use node selection cache for uploads

Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad

* fix config lock

Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd

* start fixing tests

Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf

* fix test, add some more

Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722

* change config name

Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b

* add benchmarks

Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9

* revert bench to put in different PR

Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade

* add staleness to benchmark

Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870

* add cache config to testplanet

Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff

* have repair select old way

Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252

* lower testplante config time

Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7

* fix test

Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69
2020-04-24 09:11:04 -07:00
Moby von Briesen
178aa8b5e0 satellite/{metainfo,repair}: Delete expired segments from metainfo
* Delete expired segments in expired segments service using metainfo
loop
* Add test to verify expired segments service deletes expired segments
* Ignore expired segments in checker observer
* Modify checker tests to verify that expired segments are ignored
* Ignore expired segments in segment repairer and drop from repair queue
* Add repair test to verify that a segment that expires after being
added to the repair queue is ignored and dropped from the repair queue

Change-Id: Ib2b0934db525fef58325583d2a7ca859b88ea60d
2020-04-22 13:02:31 +00:00
Jess G
75b9a5971e
satellite: update log levels (#3851)
* satellite: update log levels

Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f

* log version on start up for every service

Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de

* Use monkit for tracking failed ip resolutions

Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2

* fix compile errors

Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf

* add request limit value to storage node rpc err

Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5

* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders

Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887

* fix build errs

Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
2020-04-15 12:32:22 -07:00
Egon Elbre
6492b13d81 all: remove old uuid
Change-Id: I3a137f73456f010c37d3933dbe12cbbb840b809f
2020-04-02 19:30:36 +03:00
Egon Elbre
0a69da4ff1 all: switch to storj.io/common/uuid
Change-Id: I178a0a8dac691e57bce317b91411292fb3c40c9f
2020-03-31 19:16:41 +03:00
littleskunk
048ca4558f
satellite/repair: clean up logging (#3833)
Co-authored-by: Michal Niewrzal <michal@storj.io>
2020-03-30 11:59:56 +02:00
Egon Elbre
480ea1e4b5 satellite/repair/repairer: fix temporary file handling
Change-Id: Ice1a467510737b3375c018ae37b16431c7dffe9e
2020-03-27 21:36:23 +02:00
Moby von Briesen
a933bcc99a satellite/repair/repairer/ec.go: add option for downloading pieces onto disk instead of in memory during repair
Add flag to satellite repairer, "InMemoryRepair" that allows the
satellite to decide whether to download the entire segment being
repaired into memory (this is what the satellite already does), or to
download it into temporary files on disk that will be read from in the
upload phase of repair.

This should help with handling high repair traffic on satellites that
cannot afford to spend 64mb of memory per repair worker.

Updates tests to test repair for both in memory and to disk.

Change-Id: Iddf591e165621497c98533d45bfea3c28b08a194
2020-03-27 16:41:00 +00:00
Egon Elbre
e8f18a2cfe private/testplanet: expose storagenode and satellite Config
Change-Id: I80fe7ed8ef7356948879afcc6ecb984c5d1a6b9d
2020-03-27 17:01:25 +02:00
paul cannon
ba5991dc86 satellite/repair: add monitoring for remote_segments_healthy_percentage
Change-Id: I6ad29fe1a947ac19d15e40ea33164a510eb33d4f
2020-03-17 17:45:59 +00:00
Moby von Briesen
2f991b6c56 satellite/{overlay, satellitedb}: account for suspended field in overlay cache
Make sure that suspended nodes are treated appropriately by the overlay
cache. This means we should expect the following behavior:
* suspended nodes (vetted or not) should not be selected for uploading
new segments
* suspended nodes should be treated by the checker and repairer as
"unhealthy", and should be removed upon successful repair

This commit also removes unused overlay functionality.

Fixes a bug with commit 8b72181a1f where
the audit reporter was automatically suspending nodes regardless of
audit outcome (see test added).

Tests:
* updates repair tests to ensure that a suspended node is treated as
unhealthy and will be removed from the pointer on successful repair
* updates overlay tests for KnownUnreliableOrOffline and KnownReliable
to expect suspended nodes to be considered "unreliable"
* adds satellitedb test that ensures overlay.SelectStorageNodes and
overlay.SelectNewStorageNodes do not include suspended nodes
* adds audit reporter test to ensure that different audit outcomes
result in the correct suspended/disqualified states

Change-Id: I40dba67278c8e8d2ce0bcec5e0a5cb6e4ce2f561
2020-03-17 17:14:56 +00:00
Moby von Briesen
8b72181a1f satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.

Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
2020-03-16 20:29:26 +00:00
Bill Thorp
94c11c5212 satellite: remove some unnecessary UTC() calls
Fixes some easy cases of extraneous UTC() calls

Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a
2020-03-13 13:49:44 +00:00
Jess G
39cb821196
satellite/overlay: rm combinedcache, fix IP naming to be network (#3798)
* rn combinedcache, rm dns node lookup

Change-Id: I239f07211764b097d851230d8c81900a47756e9e

* excludeIPs -> excludedNetworks

Change-Id: Ifa6f44ab17457cdd5aff4cd5694296867c18b179

* use lowercase var name

Change-Id: I825aad2b718c71f455e747be18f8cabd02aabe55

* update Getnetwork name

Change-Id: I002a1b7bc6b4ef40159c0cd2b0ef209f80a9c503

* fix comments

Change-Id: Ibddf5b9ffa9d685af6c392d893db063ef18e45fa

* update comments with ipv6

Change-Id: I31758b7d4979e7c27d014668f4fb532ad838cda2

Co-authored-by: Stefan Benten <mail@stefan-benten.de>
2020-03-12 11:37:57 -07:00
Jessica Grebenschikov
803e2930f4 satellite: use IP for all uplink operations, use hostname for audit and repairs
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address

This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base

Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
2020-03-11 09:11:40 -07:00
paul cannon
79553059cb satellite/repair: put irreparable segments in irreparableDB
Previously, we were simply discarding rows from the repair queue when
they couldn't be repaired (either because the overlay said too many
nodes were down, or because we failed to download enough pieces).

Now, such segments will be put into the irreparableDB for further
and (hopefully) more focused attention.

This change also better differentiates some error cases from Repair()
for monitoring purposes.

Change-Id: I82a52a6da50c948ddd651048e2a39cb4b1e6df5c
2020-03-09 21:45:16 +00:00
Moby von Briesen
e4da7bd9cd satellite/repair/checker: use repair override if available in checker and irreparable
In production, the satellite is overriding the default repair threshold
(35) to a higher value (52). In some places in the checker and
irreparable processes, the repair threshold on the redundancy scheme is
used in place of the override value. This fixes those cases.

Change-Id: Ie7387217d9fb3886f050b5e5b67be51f276196de
2020-03-06 15:39:53 -05:00
Bill Thorp
e99e675fb1 satellite/satellitedb: use time zones with all timestamps
The migration was broken into one migration per table to reduce table locking and reduce the
chances of failure due to SQL timeouts.

Of the 14 fields that lacked time zones, only the 3 named 'interval_start` seemed to have non-UTC data in them.
These fields are fixed in the migration by removing the +00 and adding  AT TIME ZONE current_setting('TIMEZONE')
Field with good data are migrated by adding AT TIME ZONE 'UTC'

Note that postgres's timezone() is different than cockroach's timezone() so AT TIME ZONE is used.

https://storjlabs.atlassian.net/browse/SM-104

Change-Id: I410f2f1d7c11b143f17844347f37e6f4b1e70fce
2020-03-05 21:11:25 +00:00
Jennifer Johnson
1c1750e6be removes bandwidth limiting
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.

Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.

Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
2020-03-04 14:04:00 +00:00
Moby von Briesen
d5540c89a1 satellite/repair/checker: add monkit metrics for segments immediately above repair threshold
Record counts for segments at health=rt+1 through health=rt+5 for every checker
iteration.

Change-Id: I2a00c0bc34d17beb21cacdeab4dac77f755faefe
2020-02-26 20:27:15 +00:00
Moby von Briesen
4e5a7f13c7 satellite/repair/queue: Prioritize selection of items off repair queue by segment health
Add a column to the repair queue table in the satellite db for healthy
piece count. When an item is selected from the repair queue, the least
durable segment that has not been attempted in the past hour should be
selected first. This prevents our repairer from getting stuck doing work
on segments that are close to the repair threshold while allowing
segments that are more unhealthy to degrade further.

The migration also clears the repair queue so that the migration runs
quickly and we can properly account for segment health in future repair
work.

We do not select items off the repair queue that have been attempted in
the past six hours. This was changed from on hour to allow us time to
try a wider variety of segments when the repair queue is very large.

Change-Id: Iaf183f1e5fd45cd792a52e3563a3e43a2b9f410b
2020-02-26 09:54:16 -05:00
paul cannon
92d86fa044 satellite/repair: fix repair concurrency
This new repair timeout (configured as TotalTimeout) will include both
the time to download pieces and the time to upload pieces, as well as
the time to pop the segment from the repair queue.

This is a move from Github PR #3645.

Change-Id: I47d618f57285845d8473fcd285f7d9be9b4318c8
2020-02-24 19:57:09 +00:00
Egon Elbre
5342dd9fe6 go.mod: update uplink
Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811
2020-02-21 16:08:12 +02:00