Commit Graph

76 Commits

Author SHA1 Message Date
Natalie Ventura Villasana
6b1829f3c3
satellite/downtime: new chore estimates downtime
Adds EstimationChore to the downtime package, which is an
independent chore that finds offline nodes given a configurable
limit, then uptime checks those nodes, and sets a last contact
success or failure given a response. For failed nodes, the chore
updates the amount of downtime the node has been offline in the
DowntimeTracking table.

Design doc section: https://github.com/storj/storj/blob/master/docs/blueprints/storage-node-downtime-tracking.md#estimating-offline-time
Jira: https://storjlabs.atlassian.net/browse/V3-2545

Change-Id: I60af95803930bf9b33232b248bb20cca6f0e0b5f
2020-01-09 15:05:13 -05:00
Yingrong Zhao
76ee8a1b4c satellite: remove UptimeReputation configs from codebase
With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and
UptimeReputationDQ. This is the first step of removing the uptime
reputation columns from satellitedb

Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36
2020-01-08 18:54:15 +00:00
Natalie Ventura Villasana
1cb0f80a8d satellite/gracefulexit: dq node on exit fail
Disqualifies a node when the node fails to complete a graceful
exit.

Adds a new DisqualifyNode method to the overlay cache, since there
wasn't an existing method to disqualify a node but do nothing else
to its stats.

Adds checks to existing tests to make sure that a storage node that
fails a graceful exit is marked as disqualified in the overlay
cache.

https: //storjlabs.atlassian.net/browse/V3-3342
Change-Id: I4d554a519ab59db31ad3b8e28764c8683a6e3888
2020-01-06 19:16:26 -05:00
Moby von Briesen
6c2e4cc0cd satellite/overlay: Return NodeLastContact instead of a node dossier from
overlay.GetOfflineNodesLimited

We only care about node ID, address, and last contact success/failure
from the downtime service, so the overlay should only return these
values for the downtime-specific queries.

Change-Id: I08a6ecfdd2a12b82cae62e87d6adeab53975bfce
2020-01-06 17:12:30 -05:00
Ethan
05b406e992 satellite:{downtime,overlay}: Implement offline node detection chore
https://storjlabs.atlassian.net/browse/V3-3398

Change-Id: I598c3bad819026377d1d113c099dc9bba8b02742
2020-01-03 17:10:03 +00:00
Moby von Briesen
ff74b44c5f satellite/overlay: Add ability for overlay to get offline nodes ordered by last checked time
This is required for the downtime tracking service: https://storjlabs.atlassian.net/browse/V3-2545

Change-Id: I286cdc07d802393948eb10c25c45ba78cc3ceafc
2020-01-02 16:39:38 +00:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
Kaloyan Raev
5ee1a00857 satellite/overlay: filter reliable nodes from a list
Adds the KnownReliable method to Overlay Service that filters all nodes
from the given list to be only reliable nodes (online and qualified).
The method return []*pb.Node of reliable nodes. The pb.Node values are
ready for dialing.

The first use case is when deleting an object to efficiently dial all
reliable nodes holding a piece of that object and send them a delete
request.

Change-Id: I13e0a8666f3807c5c31ef1a1087476018a5d3acb
2019-12-17 21:20:08 +00:00
littleskunk
8b3444e088
satellite/nodeselection: don't select nodes that haven't checked in for a while (#3567)
* satellite/nodeselection: dont select nodes that havent checked in for a while

* change testplanet online window to one minute

* remove satellite reconfigure online window = 0 in repair tests

* pass timestamp into UpdateCheckIn

* change timestamp to timestamptz

* edit tests to set last_contact_success to 4 hours ago

* fix syntax error

* remove check for last_contact_success > last_contact_failure in IsOnline
2019-11-15 23:43:06 +01:00
Natalie Villasana
68a7790069 satellite/gracefulexit: select new node filtered by Distinct IP (#3435) 2019-11-06 16:38:51 -05:00
Maximillian von Briesen
54594e79c3 satellite/gracefulexit: add metrics on satellite for graceful exit (#3355) 2019-10-29 16:22:20 -04:00
Yingrong Zhao
fa1ac24e19
satellite/gracefulexit: add failure threshold check (#3329)
* add overall failure percentage check and inactive time frame check before sending a response to sno

* update comment

* delete node from transfer queue if it has been inactive for too long

* fix linting error

* add test config value

* fix nil pointer

* add config value into testplanet

* add unit test for overall failure threshold

* move timeframe threshold to chore

* update protolock

* add chore test

* add per peiece failure count logic

* change config name from EndpointMaxFailures to MaxFailuresPerPiece

* address comments

* fix linting error

* add error handling for no row returned from progress table

* fix test for graceful exit chore on storagenode

* fix typo InActive -> Inactive

* improve readability for failure threshold calculation

* update config lock

* change error handling for GetProgress in graceful exit endpoint on the satellite side

* return proper rpc error in endpoint

* add check in chore test for checking finish timestamp and queue
2019-10-24 12:24:42 -04:00
Maximillian von Briesen
abb567f6ae
cmd/satellite: add graceful exit reports command to satellite CLI (#3300)
* update lock file and add comment

* add created at and bytes transferred

* cleanup

* rename db func to GetGracefulExitNodesByTimeFrame

* fix flag

* split into two overlay functions

* := to =

* fix test

* add node not found error class

* fix overlay test

* suggested test changes

* review suggestions

* get exit status from overlay.Get()

* check rows.Err

* fix panic when ExitFinishedAt is nil

* fix comments in cmdGracefulExit
2019-10-22 21:06:01 -04:00
Natalie Villasana
45c35d7c3f
satellite/satellitedb: add exit_status column to nodes table (#3301) 2019-10-17 11:01:39 -04:00
Ethan Adams
a1275746b4
satellite/gracefulexit: Implement the 'process' endpoint on the satellite (#3223) 2019-10-11 17:18:05 -04:00
Maximillian von Briesen
0ea0d8c3da
satellite/overlay: remove overlay.IsVetted (#3203) 2019-10-08 09:25:41 -04:00
Jennifer Li Johnson
7ceaabb18e
Delete Bootstrap and Kademlia (#2974) 2019-10-04 16:48:41 -04:00
Natalie Villasana
4f2f8ae11b satellite/overlay: add UpdateExitStatus and GetExitingNodes for graceful exit (#3087) 2019-10-01 18:18:21 -04:00
paul cannon
53db517154 satellite/overlay: don't use transport observers (#2989) 2019-09-19 16:22:50 -04:00
Jess G
93788e5218
remove kademlia: create upsert query to update uptime (#2999)
* create upsert query for check-in method

* add tests

* fix lint err

* add benchmark test for db query

* fix lint and tests

* add a unit test, fix lint

* add address to tests

* replace print w/ b.Fatal

* refactor query per CR comments

* fix disqualified, only set if null

* fix query

* add version to updatecheckin query

* fix version

* fix tests

* change version for tests

* add version to tests

* add IP, add transport, mv unit test

* use node.address as arg

* add last ip

* fix lint
2019-09-19 11:37:31 -07:00
Egon Elbre
a801fab66a
all: add archview annotations (#2964) 2019-09-10 16:24:16 +03:00
Bryan White
a33106df1c
satellite/satellitedb: persist piece counts to/from db (#2803) 2019-08-27 14:37:42 +02:00
Egon Elbre
00b2e1a7d7 all: enable staticcheck (#2849)
* by having megacheck in disable it also disabled staticcheck

* fix closing body

* keep interfacer disabled

* hide bodies

* don't use deprecated func

* fix dead code

* fix potential overrun

* keep stylecheck disabled

* don't pass nil as context

* fix infinite recursion

* remove extraneous return

* fix data race

* use correct func

* ignore unused var

* remove unused consts
2019-08-22 13:40:15 +02:00
Bryan White
6400d63a6c
satellite/satellitedb: Add piece count column to nodes table (#2795) 2019-08-19 12:58:13 +02:00
Isaac Hess
e34b2c553c
Reduce UpdateAddress calls with address cache (#2681) 2019-08-06 16:56:12 -06:00
Egon Elbre
c8edeb0257
satellite/overlay: rename overlay.Cache to overlay.Service (#2717) 2019-08-06 19:35:59 +03:00