Commit Graph

154 Commits

Author SHA1 Message Date
Michal Niewrzal
b3aa28cc02 satellite/gracefulexit: migrate to metabase
Change-Id: I8be9cc68894124427e4a30d7631126b3afb1f281
2020-12-18 10:57:39 +00:00
Stefan Benten
494bd5db81
all: golangci-lint v1.33.0 fixes (#3985) 2020-12-05 17:01:42 +01:00
Kaloyan Raev
53b7fd7b00 satellite/{audit,gracefulexit}: remove logic for PieceHashesVerified
We now have the piece hashes verified for all segments on all production
satellites. We can remove the code that handles the case where piece
hashes are not verified. This would make easier the migration of
services from PointerDB to the new metabase.

For consistency, PieceHashesVerified is still set to true in PointerDB
for new segments.

Change-Id: Idf0ccce4c8d01ae812f11e8384a7221d90d4c183
2020-11-24 11:09:48 +02:00
Moby von Briesen
db6bc6503d satellite/metainfo: Update metainfo RS config to more easily support multiple RS schemes.
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.

RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).

Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
2020-11-09 22:16:13 +00:00
Egon Elbre
76f4619a9c {satellite,storagenode}/gracefulexit: ensure client is closed
Change-Id: I576a955a5578caf7fcbee832beca28cef2b0c83e
2020-10-27 23:27:07 +02:00
Kaloyan Raev
92a2be2abd satellite/metainfo: get away from using pb.Pointer in Metainfo Loop
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.

After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service

It would require additional refactoring in these two services before we
are able to clean this.

Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
2020-10-27 13:06:47 +00:00
Egon Elbre
9adde49e1a satellite/gracefulexit: ensure test doesn't timeout on failure
Change-Id: Id004f8a075592ffc19b12a9d666058b60cb7724d
2020-10-26 21:16:48 +02:00
paul cannon
76d4977b6a storagenode/gracefulexit: logic moved from worker to service
Change-Id: I8b12606a96b712050bf40d587664fb1b2c578fbc
2020-10-22 23:19:30 +00:00
Egon Elbre
0bdb952269 all: use keyed special comment
Change-Id: I57f6af053382c638026b64c5ff77b169bd3c6c8b
2020-10-13 15:13:41 +03:00
Michal Niewrzal
8649a00557 satellite/gracefulexit: replace Path []byte to `Key
metabaseSegmentKey` TransferQueueItem

We are unifying which name (and type) we are using for value we are
using to point to segment. We want to use `key` instead of `path`.
Dedicated type `metabase.SegmentKey` was created for this purposes also.
This change is doing refactoring around gracefulexit.

Change-Id: I90d51ff087b206179e61d5f1bc95f4709d76f917
2020-09-04 11:09:48 +00:00
Michal Niewrzal
9202295348 satellite/metainfo: replace ScopedPath with metabase.SegmentLocation
Change-Id: I7e89c9e8eaeae58be828a32ad47ed3028501f4c7
2020-09-04 10:06:52 +00:00
Michal Niewrzal
aa47e70f03 satellite/metainfo: use metabase.SegmentKey with metainfo.Service
Instead of using string or []byte we will be using dedicated type
SegmentKey.

Change-Id: I6ca8039f0741f6f9837c69a6d070228ed10f2220
2020-09-03 15:11:32 +00:00
Egon Elbre
3ca405aa97 satellite/orders: use metabase types as arguments
Change-Id: I7ddaad207c20572a5ea762667531770a56fd54ef
2020-08-28 15:52:37 +03:00
Qweder93
88ff8829a1 satellite/gracefulexit: RecvTimeout increased to 2h, so slow nodes stop receiving lot of fails and as a result DQ
Change-Id: Id4c8a394162ba368aeb573a927f825bf7250aa52
2020-08-24 18:59:24 +03:00
Egon Elbre
94a09ce20b all: add missing dots
Change-Id: I93b86c9fb3398c5d3c9121b8859dad1c615fa23a
2020-08-11 17:50:01 +03:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
Egon Elbre
5bdcd86fa7 ci: test benchmarks
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.

Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
2020-07-10 13:26:49 +00:00
Qweder93
e52809d53e cmd/storagenode: add check if satellites available to gracefulexit
Change-Id: I8747507593d810bbdec0d140de0600ee147011c3
2020-06-10 13:38:36 +00:00
paul cannon
7395dd1e6e storagenode/gracefulexit: revalidate existing pieces
..before they are transferred to another node and submitted to the
satellite as successful piece transfers, because if we submit an invalid
signature, the node will be marked as a cheater and disqualified
immediately.

These signatures should have been validated when the piece was
originally stored, but bitrot does happen and needn't be cause for an
immediate DQ.

Change-Id: I8b0ebd5812ea8a2e60766005b7251fbb74ef7857
2020-05-28 09:50:14 -05:00
paul cannon
0c9a4a5e56 satellite/gracefulexit: better validation error messages
Change-Id: I14168dbeaf17302e5e853854956f8fbcb82a0900
2020-05-28 09:41:15 -05:00
Ethan
7014cf2083 satellite/gracefulexit: Change handleFailed to return nil if we can't get the pending transfer
https://storjlabs.atlassian.net/browse/SM-975

Change-Id: Ia1746b5e274e5011b6cd9f5a52f9a5faf703be51
2020-05-26 13:07:38 +00:00
Egon Elbre
94b2b315f7 storagenode/trust: refactor GetAddress to GetNodeURL
Most places now need the NodeURL rather than the ID and Address
separately. This simplifies code in multiple places.

Change-Id: I52621d8ca52296a8b5bf7afbc1001cf8bfb44239
2020-05-20 11:05:15 +00:00
Egon Elbre
ed627144ed all: use DialNodeURL throughout the codebase
Change-Id: Iaf9ae3aeef7305c937f2660c929744db2d88776c
2020-05-20 10:36:30 +00:00
Moby von Briesen
46df8c1977 satellite/gracefulexit: add log message when node fails validation for piece transfer
Change-Id: Ic5a53404ceb35003793aebc63637e7f8a58ef259
2020-05-13 16:58:50 +00:00
Egon Elbre
ec589a8289 all: fix comments about grpc
Change-Id: Id830fbe2d44f083c88765561b6c07c5689afe5bd
2020-05-11 13:05:34 +03:00
Egon Elbre
7d29f2e0d3 all: remove drpc wrappers
Change-Id: I45016f7d2a771dc00776196c1f531f3343e93b40
2020-05-11 08:20:34 +03:00
Egon Elbre
e6d5ce6b77 all: remove grpc
It seems everyone has migrated to drpc.

Change-Id: Ica6b2d0bdef68c6603083f2963458843eca71e9e
2020-05-10 06:36:09 +00:00
Egon Elbre
bcd93ee375 private/testplanet: add StopNodeAndUpdate
This was commonly used and code with it can be simplified.

Change-Id: I2f2b91f7de54269aee6ef027f97f9e8a7d222e39
2020-05-08 13:02:19 +00:00
Egon Elbre
4e94da3fda satellite/overlay: add feature flag for node selection cache
Also distinguish the purpose for selecting nodes to avoid potential
confusion, what should allow caching and what shouldn't.

Change-Id: Iee2451c1f10d0f1c81feb1641507400d89918d61
2020-05-06 16:13:47 +03:00
Egon Elbre
c630cf2490 storagenode/pieces: implement buffering for writing
Currently uploads can cause a lot of IOPS, reduce this by introducing a
in-memory buffer on-top of the file.

Change-Id: I5f4e3e01c0a36258271d180b922107de447bcb59
2020-05-04 06:01:32 +00:00
Jessica Grebenschikov
6a6427526b satellite/overlay: remove old updateaddress method
The UpdateAddress method use to be used when storage node's checked in with the Satellite, but once the contact service was created this method was no longer used. This PR finally removes it.

Change-Id: Ib3f83c8003269671d97d54f21ee69665fa663f24
2020-04-30 06:41:48 +00:00
Jess G
75b9a5971e
satellite: update log levels (#3851)
* satellite: update log levels

Change-Id: I86bc32e042d742af6dbc469a294291a2e667e81f

* log version on start up for every service

Change-Id: Ic128bb9c5ac52d4dc6d6c4cb3059fbad73f5d3de

* Use monkit for tracking failed ip resolutions

Change-Id: Ia5aa71d315515e0c5f62c98d9d115ef984cd50c2

* fix compile errors

Change-Id: Ia33c8b6e34e780bd1115120dc347a439d99e83bf

* add request limit value to storage node rpc err

Change-Id: I1ad6706a60237928e29da300d96a1bafa94156e5

* we cant track storage node ids in monkit metrics so lets use logging to track that for expired orders

Change-Id: I1cc1d240b29019ae2f8c774792765df3cbeac887

* fix build errs

Change-Id: I6d0ffe058e9a38b7ed031c85a29440f3d68e8d47
2020-04-15 12:32:22 -07:00
Egon Elbre
11a44cdd88 all: don't depend on gogo/proto directly
Change-Id: I8822dea0d1b7b99e0b828e0373a0308a42dde2be
2020-04-08 17:32:15 +00:00
Egon Elbre
cb781d66c7 satellite/overlay: optimize FindStorageNodes
Reduce the number of fields returned from the query.

Benchmark results in `satellite/overlay`:

benchstat before.txt after2.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32              7.85ms ± 1%  6.27ms ± 1%  -20.18%  (p=0.002 n=10+4)
SelectNewStorageNodes-32           8.21ms ± 1%  6.61ms ± 0%  -19.53%  (p=0.002 n=10+4)
SelectStorageNodesExclusion-32     17.2ms ± 1%  15.9ms ± 1%   -7.55%  (p=0.002 n=10+4)
SelectNewStorageNodesExclusion-32  17.8ms ± 2%  16.1ms ± 0%   -9.38%  (p=0.002 n=10+4)
FindStorageNodes-32                48.4ms ± 1%  45.1ms ± 0%   -6.69%  (p=0.002 n=10+4)
FindStorageNodesExclusion-32       79.2ms ± 1%  76.1ms ± 1%   -3.89%  (p=0.002 n=10+4)

Benchmark results from `satellite/overlay` after making them parallel:

benchstat before-parallel.txt after2-parallel.txt
name                               old time/op  new time/op  delta
SelectStorageNodes-32               548µs ± 1%   353µs ± 1%  -35.60%  (p=0.029 n=4+4)
SelectNewStorageNodes-32            562µs ± 0%   368µs ± 0%  -34.51%  (p=0.029 n=4+4)
SelectStorageNodesExclusion-32     1.02ms ± 1%  0.84ms ± 0%  -18.08%  (p=0.029 n=4+4)
SelectNewStorageNodesExclusion-32  1.03ms ± 1%  0.86ms ± 2%  -16.22%  (p=0.029 n=4+4)
FindStorageNodes-32                3.11ms ± 0%  2.79ms ± 1%  -10.27%  (p=0.029 n=4+4)
FindStorageNodesExclusion-32       4.75ms ± 0%  4.43ms ± 1%   -6.56%  (p=0.029 n=4+4)

Change-Id: I1d85e2764eb270f4c2b1998303ccfc1179d65b26
2020-03-30 18:36:23 +03:00
Egon Elbre
e1a443b04a private/testplanet: allow modifying created database
Instead of providing the database from outside to testplanet create it
inside and then allow wrapping and modifying it. This is more convenient
to use.

Change-Id: I9b8f69e6e0a19ff984b4e2bfe927c9100c77bc6c
2020-03-27 19:14:48 +00:00
Egon Elbre
e8f18a2cfe private/testplanet: expose storagenode and satellite Config
Change-Id: I80fe7ed8ef7356948879afcc6ecb984c5d1a6b9d
2020-03-27 17:01:25 +02:00
Yingrong Zhao
b7b19289d1 bump storj.io/common to latest
Change-Id: I16e337660ce8e1ef332cc842dbf4cfa067b9b98b
2020-03-25 09:08:40 -04:00
Bill Thorp
94c11c5212 satellite: remove some unnecessary UTC() calls
Fixes some easy cases of extraneous UTC() calls

Change-Id: I3f4c287ae622a455b9a492a8892a699e0710ca9a
2020-03-13 13:49:44 +00:00
Jess G
39cb821196
satellite/overlay: rm combinedcache, fix IP naming to be network (#3798)
* rn combinedcache, rm dns node lookup

Change-Id: I239f07211764b097d851230d8c81900a47756e9e

* excludeIPs -> excludedNetworks

Change-Id: Ifa6f44ab17457cdd5aff4cd5694296867c18b179

* use lowercase var name

Change-Id: I825aad2b718c71f455e747be18f8cabd02aabe55

* update Getnetwork name

Change-Id: I002a1b7bc6b4ef40159c0cd2b0ef209f80a9c503

* fix comments

Change-Id: Ibddf5b9ffa9d685af6c392d893db063ef18e45fa

* update comments with ipv6

Change-Id: I31758b7d4979e7c27d014668f4fb532ad838cda2

Co-authored-by: Stefan Benten <mail@stefan-benten.de>
2020-03-12 11:37:57 -07:00
Jessica Grebenschikov
803e2930f4 satellite: use IP for all uplink operations, use hostname for audit and repairs
My understanding is that the nodes table has the following fields:
- `address` field which can be a hostname or an IP
- `last_net` field that is the /24 subnet of the IP resolved from the address

This PR does the following:
1) add back the `last_ip` field to the nodes table
2) for uplink operations remove the calls that the satellite makes to `lookupNodeAddress` (which makes the DNS calls to resolve the IP from the hostname) and instead use the data stored in the nodes table `last_ip` field. This means that the IP that the satellite sends to the uplink for the storage nodes could be approx 1 hr stale. In the short term this is fine, next we will be adding changes so that the storage node pushes any IP changes to the satellite in real time.
3) use the address field for repair and audit since we want them to still make DNS calls to confirm the IP is up to date
4) try to reduce confusion about hostname, ip, subnet, and address in the code base

Change-Id: I96ce0d8bb78303f82483d0701bc79544b74057ac
2020-03-11 09:11:40 -07:00
Jennifer Johnson
1c1750e6be removes bandwidth limiting
On satellite, remove all references to free_bandwidth column in nodes table.
On storage node, remove references to AllocatedBandwidth and MinimumBandwidth and mark as deprecated.

Protobuf message, NodeCapacity, is left intact for backwards compatibility.
Once this is released to all satellites, we can drop the column from the DB.

Change-Id: I2ff6c6537fc9008a0c5588e951afea58ede85838
2020-03-04 14:04:00 +00:00
Egon Elbre
64330c55b3 all: use pbgrpc
common/pb moved grpc to a separate package common/pb/pbgrpc.
This updates this repository to use it.

Change-Id: I2de2a190688871cf9cb61f7ea511f8a01e264e4e
2020-02-26 21:27:47 +02:00
Egon Elbre
5342dd9fe6 go.mod: update uplink
Change-Id: I867a6a1eef8aa5d60bb676e5112b98c4192ce811
2020-02-21 16:08:12 +02:00
littleskunk
76849558cb satellite/gracefulexit: increase performance and tolerate higher error
rate

Graceful exit is very slow at the moment. Over the last couple days we
increase the batch size on Stefans satellite to 1000 but as a side
effect the error rate was increased. With a batch size of 500 the error
rate looks stable.
This PR will increase the default to batch size to 300. Graceful exit
will still be painful slow but at least it will be a bit faster. At the
same time this PR also increases the number of errors we tolerate. We
don't want to DQ slow storage nodes just because they didn't finish all
300 transfers in time. We want to give them more retries.

Change-Id: I92e3f99e116d4988457d8b902a88e85ed1bcc1a7
2020-02-12 11:40:15 +00:00
Cameron Ayer
b22bf16b35 satellite/overlay: add config flag for node selection free disk requirement
Currently SNs report their free disk space once per hour. If a node
becomes full, it has to wait until the next contact cycle begins to
report; all the while receiving and failing upload requests. By increasing
the minimum required disk space, we can give the storage nodes more time
to report their space before the completely fill up. This change goes
hand-in-hand with another change we want to implement: trigger capacity
report on SN immediately upon falling below threshold.

Change-Id: I12f778286c6c3f582438b0e2949765ac43325e27
2020-02-11 18:08:25 +00:00
Michal Niewrzal
426c8eb31a private/testplanet: add DeleteBucket method for uplink
New method added to be able to delete easily bucket during tests.

Change-Id: Iaae89618cc676ddbbbd4b0df2eeacd143ea6f3c2
2020-02-11 15:58:13 +00:00
Jeff Wendling
7999d24f81 all: use monkit v3
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.

graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.

it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.

Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
2020-02-05 23:53:17 +00:00
Egon Elbre
8dea4f52db satellite: add control panel
Change-Id: Id48246e9bcd4c6ec643277fe740937b2e42ad85b
2020-01-30 08:06:43 -05:00
paul cannon
8ce9ce7f0f satellite/gracefulexit: wait for errgroup to return
credit to Yingrong

Change-Id: I538371040d4dcdf6e943c61e8454320fd57b7526
2020-01-28 19:26:43 +00:00
Jeff Wendling
26e33e7e07 satellite/gracefulexit: make orders with right bucket id and action
paths are organized as follows:

    project_id/segment_index/bucket_name/encrypted_key

so by picking parts[0] and parts[1], we were using the segment
index instead of the bucket name, causing bandwidth to be
accounted for incorrectly. additionally, we were using the
PUT action instead of the PUT_GRACEFUL_EXIT action, causing
the data to be charged incorrectly. we use PUT_REPAIR for
now because nodes won't accept uploads with PUT_GRACEFUL_EXIT
and our tables need migrations to handle rollups with it.

Change-Id: Ife2aff541222bac930c35df8fcf76e8bac5d60b2
2020-01-24 19:27:38 +00:00
Michal Niewrzal
6502454947 satellite/metainfo: move RS configuration to satellite
With this change RS configuration will be set on satellite. Uplink with
get RS values with BeginObject request and will use it. For backward
compatibility and to avoid super large change redundancy scheme stored
with bucket is not touched. This can be done in future.

Change-Id: Ia5f76fc10c37e2c44e4f7b8754f28eafe1f97eff
2020-01-22 09:33:53 +00:00
Egon Elbre
0c0b47823d satellite: use require.WithinDuration
Noticed that assert/require has WithinDuration for comparing
time.Time-s.

Change-Id: Ia340896443f610d38799b7ef245b5775eecfc92b
2020-01-21 19:43:53 +02:00
Egon Elbre
f3b4bf2b7c satellite/satellitedb/satellitedbtest: pass ctx as an argument
ctx is created in most tests, instead pass in as argument
to reduce code duplication.

Change-Id: I466c51c008392001129c8b007c9d6b3619935ac4
2020-01-20 16:35:42 +02:00
Egon Elbre
a4026f97b8 satellite: fix test time comparisons
Correct way to compare time that may have an error is to use InDelta.

Change-Id: I0140892119c44c63fa042bbc7292ab91bb33a350
2020-01-20 10:17:20 +00:00
Egon Elbre
d5438036b5 {satellite,storagnode}/gracefulexit: reduce logging
Change-Id: I9f274ede77a582fc43ef14a47bf9341d4e3083df
2020-01-19 22:36:13 +02:00
Yingrong Zhao
76ee8a1b4c satellite: remove UptimeReputation configs from codebase
With the new storage node downtime tracking feature, we need remove current uptime reputation configs: UptimeReputationAlpha, UptimeReputationBeta, and
UptimeReputationDQ. This is the first step of removing the uptime
reputation columns from satellitedb

Change-Id: Ie8fab13295dbf545e33aeda0c4306cda4ba54e36
2020-01-08 18:54:15 +00:00
Egon Elbre
082ec81714
uplink: move to storj.io/uplink (#3746) 2020-01-08 15:40:19 +02:00
Natalie Ventura Villasana
1cb0f80a8d satellite/gracefulexit: dq node on exit fail
Disqualifies a node when the node fails to complete a graceful
exit.

Adds a new DisqualifyNode method to the overlay cache, since there
wasn't an existing method to disqualify a node but do nothing else
to its stats.

Adds checks to existing tests to make sure that a storage node that
fails a graceful exit is marked as disqualified in the overlay
cache.

https: //storjlabs.atlassian.net/browse/V3-3342
Change-Id: I4d554a519ab59db31ad3b8e28764c8683a6e3888
2020-01-06 19:16:26 -05:00
Egon Elbre
2680bae88c private/testplanet: remove dependency to uplink
Remove direct dependency on uplink.RSConfig, this simplifies
moving the config file without introducing weird dependencies.

Change-Id: I7fd2a145401e0205d7047631df9d2810241efeec
2020-01-02 09:40:46 +00:00
Natalie Ventura Villasana
aa3e183c2e
satellite/gracefulexit: add ge eligibility check
Adds check to see if storage nodes are eligible to initiate
graceful exit, by checking their CreatedAt date and seeing if
their "age" is greater than the new config value:
NodeMinAgeInMonths
The default for this value is 6 months for now.

https://storjlabs.atlassian.net/browse/V3-3357

Change-Id: Ib807ab8987ddb5a38a27a83886490f73fe8c5816
2019-12-31 09:31:58 -05:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
Egon Elbre
d55288cf68 pkg/rpc: replace methods with direct calls to pb
Change-Id: I8bd015d8d316a2c12c1daceca1d9fd257f6f57bc
2019-12-22 17:12:43 +02:00
Ethan
b959ccbae6 satellite/gracefulexit: Use proper rpc status codes for disqualified nodes and too many connections
Change-Id: I41380026175e7678c7cd3d44211de8eb86ce4d0f
2019-12-20 19:05:28 +00:00
Egon Elbre
afe05edff2 {storagenode,satellite}/gracefulexit: ensure workers finish their work
Fixes a data race caused by not waiting for workers to finish
before shutting down. Currently this ended up failing logging
because it was closed when test tried to write to it.

Change-Id: I074045cd83bbf49e658f51353aa7901e9a5d074b
2019-12-17 17:21:52 +02:00
Egon Elbre
7a36507a0a private/testcontext: ensure we call cleanup everywhere
Change-Id: Icb921144b651611d78f3736629430d05c3b8a7d3
2019-12-17 14:16:09 +00:00
littleskunk
6ab72a6e79 satellite/gracefulexit: enable graceful exit in production
Change-Id: I526ce4a4de9c318f1333b793e3167f5f86d65adc
2019-12-09 17:32:34 +00:00
Ethan Adams
bba92911cc
fix calcuation of durability ration (#3656) 2019-11-26 12:04:48 -05:00
Maximillian von Briesen
1339252cbe
satellite/gracefulexit: refactor concurrency (#3624)
Update PendingMap structure to also handle concurrency control between the sending and receiving sides of the graceful exit endpoint.
2019-11-21 17:03:16 -05:00
Egon Elbre
ee6c1cac8a
private: rename internal to private (#3573) 2019-11-14 21:46:15 +02:00
Natalie Villasana
1a9757a7f2 satellite/gracefulexit: add count for order limits sent from satellite to exiting node (#3544) 2019-11-13 09:54:50 -05:00
Yingrong Zhao
69b0ae02bf
satellite/gracefulexit: separate functional code in endpoint (#3476) 2019-11-08 13:57:51 -05:00
Yingrong Zhao
6331f839ae
satellite/gracefulexit: not allow disqualified node to graceful exit (#3493) 2019-11-07 12:19:34 -05:00
Ethan Adams
f3dccb56b1
satellite/gracefulexit: Check if pointer has been overwritten or deleted before sending transfer message. (#3481) 2019-11-07 11:13:05 -05:00
Natalie Villasana
68a7790069 satellite/gracefulexit: select new node filtered by Distinct IP (#3435) 2019-11-06 16:38:51 -05:00
Egon Elbre
cc032d3151 satellite/metainfo: fix some uses of metainfo.Delete (#3513)
* satellite/metainfo: rename Delete to UnsynchronizedDelete

* fix deletes

* make db private

* fix typos

* also verify on commit object
2019-11-06 18:02:14 +01:00
littleskunk
7eb6724c92
logging: unify logging around satellite ID, node ID and piece ID (#3491)
* logging: unify logging around satellite ID, node ID and piece ID

* unify segment index
2019-11-05 22:04:07 +01:00
Ethan Adams
2eb0cc56fe
satellite/gracefulexit: Check if node already has a piece in the pointer (#3434) 2019-11-05 14:13:45 -05:00
Maximillian von Briesen
78fedf5db3
satellite/gracefulexit: handle piece not found messages from storagenode (#3456)
* If a node claims to fail a transfer due to piece not found, remove that node from the pointer, delete the transfer queue item.
* If the pointer is piece hash verified, penalize the node. Otherwise, do not penalize the node.
2019-11-05 10:04:39 -05:00
Maximillian von Briesen
f9df0ea591
satellite/gracefulexit: check for unknown error in graceful exit disable test
Allow error in graceful exit disable test to be rpcstatus.Unimplemented (grpc) or rpcstatus.Unknown (drpc)
2019-11-01 17:21:30 -04:00
Maximillian von Briesen
590312970d satellite/gracefulexit: add flag for enabling/disabling graceful exit on the satellite (#3437) 2019-11-01 16:21:24 +02:00
Ethan Adams
43103ae13f
lower storage node counts in tests (#3427) 2019-10-31 10:57:54 -04:00
Natalie Villasana
4878135068
satellite/gracefulexit, storagenode/gracefulexit: add timeouts (#3407) 2019-10-30 13:40:57 -04:00
Egon Elbre
65a8e0bcbc
{satellite,storagenode}/gracefulexit: clearer log messages (#3413) 2019-10-30 10:21:27 +02:00
Maximillian von Briesen
54594e79c3 satellite/gracefulexit: add metrics on satellite for graceful exit (#3355) 2019-10-29 16:22:20 -04:00
Maximillian von Briesen
cd0940724c satellite/gracefulexit: use sync2 cycle inside satellite graceful exit endpoint (#3394) 2019-10-29 14:40:42 -04:00
Maximillian von Briesen
cd3d3850f9 satellite/gracefulexit: only allow one connection per node to graceful exit endpoint (#3357) 2019-10-29 13:23:17 -04:00
Ethan Adams
7c2daa4dd9
Use original pointer when calling UpdatePieces (#3397) 2019-10-28 14:43:46 -04:00
Ethan Adams
9905f2c61e add piece num to transfer queue PK (#3390) 2019-10-28 11:08:33 -04:00
Yingrong Zhao
292e64ee2f
satellite/gracefulexit: check duplicate node id before update pointer (#3380)
* check duplicate node id before update pointer

* add test for transfer failure when pointer already contain the receiving node id

* check exiting and receiving nod are still in the pointer

* check node id only exists once in a pointer

* return error if the existing node doesn't match with the piece info in the pointer

* try to recreate the issue on jenkins

* should not remove exiting node piece in test

* Update satellite/gracefulexit/endpoint.go

Co-Authored-By: Maximillian von Briesen <mobyvb@gmail.com>

* Update satellite/gracefulexit/endpoint.go

Co-Authored-By: Maximillian von Briesen <mobyvb@gmail.com>
2019-10-27 14:20:22 -04:00
Maximillian von Briesen
a4e618fd1f handle context cancelled in satellite graceful exit endpoint (#3388) 2019-10-27 10:14:25 -04:00
Ethan Adams
e54d290d2e satellite/gracefulexit: Add signatures for success/failed exit finished messages. (#3368)
* add signatures, fix process loop bug, move delete to on success

* added tests for signatures

* PR comment updates

* fixed setting reason by default.

* updates for PR comments

* added signed failure when verificationi fails

* moved to sign_test

* fix panic

* removed testplanet from test
2019-10-25 16:36:26 -04:00
Maximillian von Briesen
6df4d7bc73
storagenode/gracefulexit + satellite/gracefulexit: add storagenode-side transfer validation (#3371)
* Make the exiting node check piece hashes, piece IDs, and piece hash signatures before relaying successful transfer data to the satellite.
* Enable immediate graceful exit failure for "successful" transfers that fail satellite-side validation.
* Move transfer piece logic in storagenode worker to separate function (to make the worker easier to understand)
2019-10-25 13:16:20 -04:00
Ethan Adams
f0caa6ce5e
satellite:gracefulexit: Update pointer after success (#3369) 2019-10-25 11:14:22 -04:00
Natalie Villasana
696c567e89
satellite/gracefulexit: add piece hash validation for successful transfer (#3313) 2019-10-24 15:38:40 -04:00
Yingrong Zhao
fa1ac24e19
satellite/gracefulexit: add failure threshold check (#3329)
* add overall failure percentage check and inactive time frame check before sending a response to sno

* update comment

* delete node from transfer queue if it has been inactive for too long

* fix linting error

* add test config value

* fix nil pointer

* add config value into testplanet

* add unit test for overall failure threshold

* move timeframe threshold to chore

* update protolock

* add chore test

* add per peiece failure count logic

* change config name from EndpointMaxFailures to MaxFailuresPerPiece

* address comments

* fix linting error

* add error handling for no row returned from progress table

* fix test for graceful exit chore on storagenode

* fix typo InActive -> Inactive

* improve readability for failure threshold calculation

* update config lock

* change error handling for GetProgress in graceful exit endpoint on the satellite side

* return proper rpc error in endpoint

* add check in chore test for checking finish timestamp and queue
2019-10-24 12:24:42 -04:00
Maximillian von Briesen
abb567f6ae
cmd/satellite: add graceful exit reports command to satellite CLI (#3300)
* update lock file and add comment

* add created at and bytes transferred

* cleanup

* rename db func to GetGracefulExitNodesByTimeFrame

* fix flag

* split into two overlay functions

* := to =

* fix test

* add node not found error class

* fix overlay test

* suggested test changes

* review suggestions

* get exit status from overlay.Get()

* check rows.Err

* fix panic when ExitFinishedAt is nil

* fix comments in cmdGracefulExit
2019-10-22 21:06:01 -04:00
Ethan Adams
3e0d12354a
storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 (#3322) 2019-10-22 16:42:21 -04:00
Natalie Villasana
22d0f89941
satellite/gracefulexit: use zap.Stringer instead of zap.String (#3299) 2019-10-17 10:29:35 -04:00
Ethan Adams
37ab84355f
satellite/gracefulexit: protobuf field name updates (#3284)
rename piece_id to original_piece_id
2019-10-15 15:59:12 -04:00
Ethan Adams
78ccf14837 fix interface and EOF check (#3251) 2019-10-12 07:06:20 -06:00