Commit Graph

32 Commits

Author SHA1 Message Date
Egon Elbre
f2d8e97d97 satellite/satellitedb: simplify select nodes query construction
Change-Id: I07009b28762d4485929a2a999e8f4be8179bee51
2021-10-22 07:41:07 +00:00
Yingrong Zhao
646ce5b8cc satellite/overlay: remove reputation logic from overlay
Change-Id: I3492860e4537c7a8e4e824ec4c9c8d179134a0c0
2021-07-28 15:15:28 -04:00
Yingrong Zhao
f8914ccce0 satellite/{repair, overlay}: use reputation store in repair
Change-Id: I48db9e68f48239d48621ccc77d33618ecb83ce1a
2021-07-28 13:22:05 -04:00
Egon Elbre
8f15f975a2 satellite/overlay: improve contended update checkin
Improve UpdateCheckIn on a contended row:

  name                             old time/op  new time/op delta
  UpdateCheckInContended-100x-32   2.29s ±55%   0.17s ±61%  -92.45%  (p=0.008 n=5+5)

Change-Id: I053ab9f1cff136c306e5fb57f5e355cdc0269a8c
2021-05-16 20:41:12 +03:00
Egon Elbre
19e3dc4ec0 satellite/overlay: rename NodeSelectionCache to UploadSelectionCache
It wasn't obvious that NodeSelectionCache was only for uploads.

Change-Id: Ifeeaa6fdb50a4b7916245b48d8634d70ac54459c
2021-01-28 14:56:53 +02:00
Cameron Ayer
d14607a5f7 satellite/{contact,nodestats,overlay,satellitedb}: remove references to total_uptime_count and uptime_success_count columns
Change-Id: I1f92022909bc564e9b1e31bf937fdfe7c16554de
2021-01-19 15:43:02 -05:00
Ethan Adams
6070018021
satellite/overlay: use AS OF SYSTEM TIME with Cockroach
Query nodes table using AS OF SYSTEM TIME '-10s' (by default) when on CRDB to alleviate contention on the nodes table and minimize CRDB retries. Queries for standard uploads are already cached, and node lookups for graceful exit uploads has retry logic so it isn't necessary for the nodes returned to be current.
2020-12-22 21:07:07 +02:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Jennifer Johnson
4e2413a99d satellite/satellitedb: uses vetted_at field to select for reputable nodes
Additionally, this PR changes NewNodeFraction devDefault and testplanet config from 0.05 to 1.
This is because many tests relied on selecting nodes that were reputable based on audit and uptime
counts of 0, in effect, selecting new nodes as reputable ones.
However, since reputation is now indicated by a vetted_at db field that is explicitly set
rather than implied by audit and uptime counts, it would be more complicated to try to
update all of the nodes' reputations before selecting nodes for tests.
Now we just allow all test nodes to be new if needed.

Change-Id: Ib9531be77408662315b948fd029cee925ed2ca1d
2020-09-04 16:45:32 +00:00
Moby von Briesen
60a95d0dc9 satellite/{satellitedb,overlay}: Enable offline suspension and review period
When a node's audit history "online score" passes below a configured
threshold, the node goes into "offline suspension" mode and begins a
review period, where the operator is given an opportunity to bring their
node back online.
After the review period passes, offline suspension is turned off for the
node.

In the future, if a node still has a bad online score at the end of the
review period, it will be disqualified. This is disabled right now.
In the future, if a node is in offline suspension, it will be treated as
"unhealthy". Right now, there are no consequences for being in offline
suspension.

Minor changes:
* Moves AuditHistoryConfig out of UpdateStats/BatchUpdateStats args and
into UpdateRequest.
* Adds "now" argument to UpdateStats/BatchUpdateStats args for easy
testing.
* Changes formatting strings inside buildUpdateStatement to use specific
types.

Change-Id: I032b60298840fc16e6ef831da750f2d57619a397
2020-08-28 16:35:48 +00:00
Moby von Briesen
959cd5cd83 satellite/satellitedb: Update audit history from overlay.UpdateStats and overlay.BatchUpdateStats
Change-Id: Ib530b61895ca4a8b12ba022c408a416b237b56d7
2020-08-20 22:46:28 +00:00
Egon Elbre
5bdcd86fa7 ci: test benchmarks
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.

Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
2020-07-10 13:26:49 +00:00
Cameron Ayer
3b4b5f45c7 satellite: replace references to Suspended with UnknownAuditSuspended
Change-Id: I3d2d00c95954c0546ad077702617895f262926ef
2020-06-23 14:19:22 +00:00
Egon Elbre
f68e7b3fde satellite/overlay: replace pb.InfoResponse
pb.InfoResponse wasn't used for protocol buffer communication, but
instead as a satellite type.

Change-Id: I755619f2deec5b76c4fe488591b7d8c1b9fcdafb
2020-06-16 15:16:55 +03:00
Egon Elbre
678b859172 satellite/overlay: remove MinimumRequiredNodes
In non-test code we were only using RequestedCount, not need to have
MinimumRequiredNodes.

Change-Id: I40736f4b028b41e94abfdeb221bce5aa86a5cb82
2020-05-07 15:41:23 +00:00
Egon Elbre
4e94da3fda satellite/overlay: add feature flag for node selection cache
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.

Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
2020-05-06 16:13:47 +03:00
Jessica Grebenschikov
6a6427526b satellite/overlay: remove old updateaddress method
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.

Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
2020-04-30 06:41:48 +00:00
Jess G
825226c98e
satellite/overlay: use node selection cache for uploads (#3859)
* satellite/overlay: use node selection cache for uploads

Change-Id: Ibd16cccee979d0544f2f4a01749af9f36f02a6ad

* fix config lock

Change-Id: Idd307e4dee8ab92749f1ec3f996419ea0af829fd

* start fixing tests

Change-Id: I207d373a3b2a2d9312c9e72fe9bd0b01e06ad6cf

* fix test, add some more

Change-Id: I82b99c2004fca2510965f9b389f87dd4474bc722

* change config name

Change-Id: I0c0f7fc726b2565dc3828cb723f5459a940f2a0b

* add benchmarks

Change-Id: I05fa25bff8d5b65f94d918556855b95163d002e9

* revert bench to put in different PR

Change-Id: I0f6942296895594768f19614bd7b2e3b9b106ade

* add staleness to benchmark

Change-Id: Ia80a310623d5a342afa6d835402170b531b0f870

* add cache config to testplanet

Change-Id: I39abdab8cc442694da543115a9e470b2a8a25dff

* have repair select old way

Change-Id: I25a938457d7d1bcf89fd15130cb6b0ac19585252

* lower testplante config time

Change-Id: Ib56a2ed086c06bc6061388d15a10a2526a663af7

* fix test

Change-Id: I3868e9cacde2dfbf9c407afab04dc5fc2f286f69
2020-04-24 09:11:04 -07:00
Jess G
7a4dcd61f7
satellite/overlay: add changes to selected node benchmarks (#3862)
* add changes to selected node benchmarks

Change-Id: I0259af155f9151cc2c7830d10f8907634c5e494f

* fix lint

Change-Id: I6c7b82bbfa579b468712f90fc03b12a931874a54

* restart jenkins

Change-Id: I1d7300343e94e695cd1c93a3b59895f52bbcb11e
2020-04-23 15:30:50 -07:00
Moby von Briesen
d7794a4851 satellite/overlay: hardcode default values for audit alpha/beta
Alpha=1 and beta=0 are the expected first values for any alpha/beta
reputation system we are using in the codebase. So we are removing the
configurability of these values.

Change-Id: Ic61861b8ea5047fa1438ea6609b1d0048bf0abc3
2020-04-14 19:12:40 +00:00
Natalie Villasana
cf80b3caf3
satellite/overlay: combine SelectStorageNodes and SelectNewStorageNodes (#3831) 2020-04-09 11:19:44 -04:00
Egon Elbre
c970969503 satellite/overlay: add benchmark for node selection
Change-Id: I15b767a78b662f8276e656b3fb73a15ec59e76c8
2020-03-27 23:09:29 +02:00
Moby von Briesen
8b72181a1f satellite/{audit,overlay,satellitedb}: implement unknown audit reputation and suspension
* change overlay.UpdateStats to allow a third audit outcome. Now it can
handle successful, failed, and unknown audits.
* when "unknown audit reputation"
(unknownAuditAlpha/(unknownAuditAlpha+unknownAuditBeta)) falls below the
DQ threshold, put node into suspension.
* when unknown audit reputation goes above the DQ threshold, remove node
from suspension.
* record unknown audits from audit reporter.
* add basic tests around unknown audits and suspension.

Change-Id: I125f06f3af52e8a29ba48dc19361821a9ff1daa1
2020-03-16 20:29:26 +00:00
Jessica Grebenschikov
803e2930f4 satellite: use IP for all uplink operations, use hostname for audit and repairs
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address

This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base

Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
2020-03-11 09:11:40 -07:00
Jennifer Johnson
1c1750e6be removes bandwidth limiting
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.

Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.

Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
2020-03-04 14:04:00 +00:00
Yingrong Zhao
76ee8a1b4c satellite: remove UptimeReputation configs from codebase
With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and
UptimeReputationDQ. This is the first step of removing the uptime
reputation columns from satellitedb

Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36
2020-01-08 18:54:15 +00:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
littleskunk
8b3444e088
satellite/nodeselection: don't select nodes that haven't checked in for a while (#3567)
* satellite/nodeselection: dont select nodes that havent checked in for a while

* change testplanet online window to one minute

* remove satellite reconfigure online window = 0 in repair tests

* pass timestamp into UpdateCheckIn

* change timestamp to timestamptz

* edit tests to set last_contact_success to 4 hours ago

* fix syntax error

* remove check for last_contact_success > last_contact_failure in IsOnline
2019-11-15 23:43:06 +01:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
Egon Elbre
9ceff9f9c6 satellite/overlay: move CheckIn benchmark to overlay (#3095) 2019-09-20 16:35:52 -04:00
ethanadams
c9b46f2fe2
V3-1987: Optimize audits stats persistence (#2632)
* Added batch update stats for recordAuditSuccessStatus
* Added batch update stats to recordAuditFailStatus
* added configurable batch size
* build individual update/delete statements so the statements can be batched into 1 call to the DB
* notified #config-changes channel and ran make update-satellite-config-lock
* updated tests to use batch update stats
2019-07-31 13:21:06 -04:00
Egon Elbre
5d0816430f
rename all the things (#2531)
* rename pkg/linksharing to linksharing
* rename pkg/httpserver to linksharing/httpserver
* rename pkg/eestream to uplink/eestream
* rename pkg/stream to uplink/stream
* rename pkg/metainfo/kvmetainfo to uplink/metainfo/kvmetainfo
* rename pkg/auth/signing to pkg/signing
* rename pkg/storage to uplink/storage
* rename pkg/accounting to satellite/accounting
* rename pkg/audit to satellite/audit
* rename pkg/certdb to satellite/certdb
* rename pkg/discovery to satellite/discovery
* rename pkg/overlay to satellite/overlay
* rename pkg/datarepair to satellite/repair
2019-07-28 08:55:36 +03:00