Commit Graph

4678 Commits

Author SHA1 Message Date
paul cannon
8616fc146d satellite/orders: send IPs for graceful exit
Storage nodes undergoing Graceful Exit have up to now been receiving
hostnames for all other storage nodes they need to contact when
transferring pieces. This adds up to a lot of DNS lookups, which
apparently overwhelm some home routers. There does not seem to be any
need for us to send hostnames for graceful exit as opposed to IP
addresses; we already use IP addresses (as given by the last_ip_port
column in the nodes table) for all the GET and PUT orders we send out.

This change causes IP addresses to be used instead.

I started trying to construct a test to ensure that the behavior
changed, but it was rabbit-holing, so I've begun to feel that maybe this
change doesn't require one; it is a very simple change, and very much of
the same nature as what we already do for IPs in CreateGetOrderLimits
and CreatePutOrderLimits (and others).

Change-Id: Ib2b5ffe7a9310e9cdbe7464450cc7c934fa229a1
2020-11-04 00:17:20 +00:00
Egon Elbre
c55c23f81f private/testplanet: add STORJ_TESTPLANET_ABSTIME
Allow setting STORJ_TESTPLANET_ABSTIME=1 to use absolute time in
testplanet logs.

Change-Id: I4df5dfc1fc055d9726aed65242ab71338550e671
2020-11-03 15:44:18 +02:00
Cameron Ayer
dc67ce74c9 satellite: remove IsUp field from overlay.UpdateRequest
With the new overlay.AuditOutcome type for offline audits, the
IsUp field is redundant. If AuditOutcome != AuditOffline, then
the node is online.

In addition to removing the field itself, other changes needed
to be made regarding the relationship between 'uptime' and 'audits'.
Previously, uptime and audit outcome were completely separated. For
example, it was possible to update a node's stats to give it a
successful/failed/unknown audit while simultaneously indicating that
the node was offline by setting IsUp to false. This is no longer possible
under this changeset. Some test which did this have been changed slightly
in order to pass.

Also add new benchmarks for UpdateStats and BatchUpdateStats with different
audit outcomes.

Change-Id: I998892d615850b1f138dc62f9b050f720ea0926b
2020-11-02 15:34:17 -05:00
Egon Elbre
0c23b12038 private/testplanet: use relative time logging
Instead of printing RFC3339 timestamp, we'll print relative time
since the creation of the testplanet.

Before:

    logger.go:130: 2020-11-02T14:54:53.864+0200 DEBUG   versioncontrol   addr= 127.0.0.1:30904

After:

    log.go:54: 00:00.002        DEBUG   versioncontrol   addr= 127.0.0.1:30945

Change-Id: Ifa423f9d54d4e7c583d9290fe36a791d28166f8f
2020-11-02 17:53:18 +00:00
Egon Elbre
7183dca6cb all: fix defers in loop
defer should not be called in a loop.

Change-Id: Ifa5a25a56402814b974bcdfb0c2fce56df8e7e59
2020-11-02 15:06:38 +02:00
Egon Elbre
fd8e697ab2 {satellite,storagenode}/internalpb: use specific package name
Ensure we don't register types with the same name into protobuf.

Change-Id: I53d025863fff8c91a067ca5819befa87eb5e35bb
2020-10-30 17:31:08 +02:00
Egon Elbre
9674ce40e6 mod: bump dependencies
Change-Id: I61beaf9793609f7614478cfa604b2721d9010b52
2020-10-30 16:17:55 +02:00
Michal Niewrzal
0205f0d807 satellite/metainfo: fix usage of types from internalpb
After moving SatStreamID and SatSegmentID from common I missed changing
some methods in metainfo endpoint. This change is a fix for that.

Change-Id: I34e121fce47371ee4cfd92cce03809520b68859f
2020-10-30 16:03:45 +02:00
Egon Elbre
77c4f99fa0 satellite/internalpb: move delegated_repair.proto
Change-Id: If4f37c52b151e09cf35d2145b463ef1e9ab529ae
2020-10-30 15:31:32 +02:00
Egon Elbre
11338e9beb satellite/internalpb: move audithistory.pb
Change-Id: I8eee84d49ed90459168ddaf04ae57f790c2a22c4
2020-10-30 15:30:11 +02:00
Egon Elbre
1903b15474 storagenode/internalpb: move gracefulexit.proto
Change-Id: Ia3614846ed49a39c8f39331516d16d45a695240b
2020-10-30 15:24:56 +02:00
Egon Elbre
cda67a659a storagenode/internalpb: move inspector.proto
Change-Id: I951379c3b2ff00d1bc09d6a49c026a7e723432d6
2020-10-30 14:51:26 +02:00
Egon Elbre
7ce372c686 satellite/internalpb: add inspectors
Change-Id: Ib688e43d05135c0c31ae95df533f1e4535ea396a
2020-10-30 13:28:17 +02:00
Egon Elbre
004e610d0f satellite/internalpb: move datarepair.pb to internal
Change-Id: If901d9ff4e5ee6715b963eeeb46513a602a44b3d
2020-10-30 13:28:14 +02:00
Michal Niewrzal
8f26f66da0 internalpb: move satellite specific protobuf types storj/storj
We have some types that are only valid for satellite usage. Such types
are SatStreamID and SatSegmentID. This change moves those types to
storj/storj and adds basic infrastructure for generating code.

Change-Id: I1e643844f947ce06b13e51ff16b7e671267cea64
2020-10-30 08:49:16 +00:00
Qweder93
f5ba8b8009 storagenode/suspensions: added offline-suspension notificatio chore + tests
Change-Id: I2521cd2e7d08a1dd379e717a554a026c7508c18f
2020-10-29 19:44:22 +02:00
Egon Elbre
e0dca4042d all: add pprof labels for debugger
By using pprof.Labels debugger is able to show service/peer names in
goroutine names.

Change-Id: I5f55253470f7cc7e556f8e8b87f746394e41675f
2020-10-29 15:10:07 +00:00
Qweder93
624255e8ba storagenode/secret: db tests added, small renaming fixes added
Change-Id: I7eae1a9a64c20a39c97e81fa741cfc9b9e1e615a
2020-10-29 14:23:04 +02:00
Egon Elbre
caefde6b32 private/{dbutil,tagsql}: pass ctx to database opening
Database opening usually dial and hence we should pass ctx to them.

Change-Id: Iaa2875981570d83e65be3710f841cf30349f807b
2020-10-29 10:51:29 +00:00
Egon Elbre
e3985799a1 storage/{cockroachkv,postgreskv}: add ctx to opening
Database opening usually dial and hence we should pass ctx to them.

Change-Id: Iecf41241aaa94d54506cbc80b0e53449848d8819
2020-10-29 10:49:08 +00:00
Egon Elbre
89ce1fe626 storagenode/storagenodedb: add ctx to OpenNew and OpenExisting
Database opening usually dial and hence we should pass ctx to them.

Change-Id: I9160ae95829f22f347bd525904898a47279a7427
2020-10-29 09:52:37 +02:00
Egon Elbre
096445bc1c certificate/authorization: add ctx to OpenDB
Database opening usually dial and hence we should pass ctx to them.

Change-Id: I1362783568f66383c46f07be7549327bb1aaa39e
2020-10-29 09:46:23 +02:00
Egon Elbre
d0beaa4a87 pkg/revocation: pass ctx into opening the database
Opening a databases requires ctx, this is first step to passing ctx
to the appropriate level.

Change-Id: I12700f39a320206d8a2a4e054452319f8585b44b
2020-10-29 07:15:36 +00:00
Egon Elbre
9b2e00a38b satellite: pass ctx into satellitedb.Open
Opening a database requires ctx, this is first step to passing ctx
to the appropriate level.

Change-Id: Ic303e69f868ef3449ae36377937a29670cf635e2
2020-10-29 06:38:37 +00:00
littleskunk
ed1f6d7973
satellite/config: move repair override from config to default (#3958)
Co-authored-by: Igor <38665104+ihaid@users.noreply.github.com>
2020-10-28 17:24:39 +02:00
Michal Niewrzal
cb1fea87f8 satellite/metainfo: mark unused methods as 'not implemented'
Some of metainfo endpoint methods are not used but we still have
implementation there. This change removes unused code and returns
unimplemented error for those methods.

Change-Id: I74e75e0caff76a4f5d119ee989b687b4e9d6e6f9
2020-10-28 12:42:47 +00:00
Michal Niewrzal
1adb497a71 satellite/metainfo: remove unused code
This change removed unused 'createRequests' struct. As far I remember it
was used to help validating old metainfo beginObject/commitObject flow.

Change-Id: I0f139b9934196d73f26eafa347ba5605722f3a55
2020-10-28 12:40:14 +01:00
Egon Elbre
76f4619a9c {satellite,storagenode}/gracefulexit: ensure client is closed
Change-Id: I576a955a5578caf7fcbee832beca28cef2b0c83e
2020-10-27 23:27:07 +02:00
Jessica Grebenschikov
99c88efbbf scripts/tests: fix gateway tests
Change-Id: I9a23ef08794043ad615066ae5929df9ff3a02d69
2020-10-27 08:21:28 -07:00
Kaloyan Raev
92a2be2abd satellite/metainfo: get away from using pb.Pointer in Metainfo Loop
As part of the Metainfo Refactoring, we need to make the Metainfo Loop
working with both the current PointerDB and the new Metabase. Thus, the
Metainfo Loop should pass to the Observer interface more specific Object
and Segment types instead of pb.Pointer.

After this change, there are still a couple of use cases that require
access to the pb.Pointer (hence we have it as a field in the
metainfo.Segment type):
1. Expired Deletion Service
2. Repair Service

It would require additional refactoring in these two services before we
are able to clean this.

Change-Id: Ib3eb6b7507ed89d5ba745ffbb6b37524ef10ed9f
2020-10-27 13:06:47 +00:00
Cameron Ayer
bb7be23115 satellite/{audit,overlay,satellitedb}: enable reporting offline audits
- Remove flag for switching off offline audit reporting.
- Change the overlay method used from UpdateUptime to BatchUpdateStats, as this
is where the new online scoring is done.
- Add a new overlay.AuditOutcome type: AuditOffline. Since we now use the same
method to record offline audits as success, failure, and unknown, we need to
distinguish offline audits from the rest.

Change-Id: Iadcfe10cf13466fa1a1c2dc542db8994a6423355
2020-10-27 10:44:46 +00:00
Moby von Briesen
2fbb4095b2 storagenode/orders/ordersfile: Handle remaining pb.Unmarshal errors
Missed one case of Unmarshal in the previous commit for V0 files (0f4e4969b7)
In V1, unmarshalling was being attempted before the checksum was
verified, so this commit moves those calls to the end of the V1 ReadOne
function.

Change-Id: Ic0b49f0bbc91fb61fb28af6003060994d0af22ed
2020-10-26 20:27:05 +00:00
Egon Elbre
9adde49e1a satellite/gracefulexit: ensure test doesn't timeout on failure
Change-Id: Id004f8a075592ffc19b12a9d666058b60cb7724d
2020-10-26 21:16:48 +02:00
Egon Elbre
22ec940f7e storage/filestore: defer closing
Change-Id: Iccd6d1c64c1b7a6eecaa4c675bb7b554b381d0f5
2020-10-26 21:09:58 +02:00
VitaliiShpital
2f4e383997 web/satellite: project edit page: back button click area decreased
WHAT:
decreased back button's clickable area to avoid missclicking

WHY:
bug fixing

Change-Id: Ia38ff076b4204b00822cf97c7cca52dbef38baf5
2020-10-26 18:02:27 +00:00
Moby von Briesen
53ba01b1f1 storagenode/orders/ordersfile/v0.go: Return ErrEntryCorrupt on pb.Unmarshal failure
In V0 orders files, unexpected EOF is correctly treated as a file
corruption, but pb.Unmarshal can also fail, and this is not treated as a
file corruption. This commit fixes that.

Change-Id: I6b446a10f4b1a5a44e832cbcc9bf8b2548cfcfeb
2020-10-26 17:38:22 +00:00
Jessica Grebenschikov
f5880f6833 satellite/orders: rollout phase3 of SettlementWithWindow endpoint
Change-Id: Id19fae4f444c83157ce58c933a18be1898430ad0
2020-10-26 14:56:28 +00:00
Ethan
9a29ec5b3e Add index to graceful_exit_transfer_queue table
This fixes a slow query that was taking up to 4 seconds in production

SELECT node_id, path, piece_num, root_piece_id, durability_ratio, queued_at, requested_at, last_failed_at, last_failed_code, failed_count, finished_at, order_limit_send_count
	FROM graceful_exit_transfer_queue
	WHERE node_id = '[redacted]'
	AND finished_at is NULL
	AND last_failed_at is NULL
	ORDER BY durability_ratio asc, queued_at asc LIMIT 300 OFFSET 0;

Change-Id: Ib89743ca35f1d8d0a1456b20fa08c683ebdc1549
2020-10-26 14:47:48 +00:00
Ivan Fraixedes
4b61ca638b
satellite/console/consoleweb/consoleapi: Fix & add test DeleteAccount
Fix the DeleteAccount handler to return 501 HTTP status code because
it's what corresponds for a "Not Implemented" status.

Add a black box test for the DeleteAccount to ensure that always return
an error response because, at this time, we don't allow to delete
accounts through the API.

This test was not added to the corresponding commit
https://review.dev.storj.io/c/storj/storj/+/2712 due to the rush to
fix it.

Change-Id: Ibcf09e2ec52f182a8a580d606c457328d94c8b60
2020-10-23 09:14:50 +02:00
paul cannon
76d4977b6a storagenode/gracefulexit: logic moved from worker to service
Change-Id: I8b12606a96b712050bf40d587664fb1b2c578fbc
2020-10-22 23:19:30 +00:00
Ivan Fraixedes
9abdcc05e5 satellite/console/consoleweb/consoleapi: report err to monkit
Report the "Not Implemented" error response returned by DeleteAccount
API handler to monkit.

Change-Id: I17e319639c458cbe803b65b5a34111b8f74daece
2020-10-22 17:07:13 +00:00
Yingrong Zhao
746cbfc659 scripts/tests/rollingupgrade: test current release version on master
branch

Currently, we are testing previous release version upgrading to latest
master on each master build
However, this behavior is only desired when the test is running on a
release branch.

Change-Id: Iaeb66f44951c9e4934ca3c8316d1e490d7958239
2020-10-22 11:45:54 -04:00
NickolaiYurchenko
d6b9563e56 web/satellite: disposed removed from historical gross total, total+surge calculation changed
Change-Id: If69c251bd12e0a2141ea0061353ddcc7ee618aaf
2020-10-22 17:06:11 +03:00
Ivan Fraixedes
46b12c96bd satellite/console/consoleweb/consoleql: Fix typo
Fix a typo in the GraphQL mutation testing function.

Change-Id: I1c474795bfbaa3151b04cb768dfc506e654557ab
2020-10-22 13:30:20 +00:00
Kaloyan Raev
1f386db566
cmd/satellite: remove metainfo commands (#3955) 2020-10-22 13:33:09 +03:00
Kaloyan Raev
1aeb14e65e satellite/audit: do not delete expired segments
A year ago we made the audit service deleting expired segments.
Meanwhile, we introduced an expired deletetion sub-service in the
metainfo service which sole purpose is deleting expired segments.

Therefore, now we are removing this responsibility from the audit
service. It will continue to avoid reporting failures on expired
segments, but it would not delete them anymore.

We do this to cleanup responsibilities in advance of the metainfo
refactoring.

Change-Id: Id7aab2126f9289dbb5b0bdf7331ba7a3328730e4
2020-10-22 08:24:16 +00:00
Jessica Grebenschikov
89bdb20a62 storagenodedb/orders: select unsent satellite with expiration
In production we are seeing ~115 storage nodes (out of ~6,500) are not using the new SettlementWithWindow endpoint (but they are upgraded to > v1.12).

We analyzed data being reported by monkit for the nodes who were above version 1.11 but were not successfully submitting orders to the new endpoint.
The nodes fell into a few categories:
1. Always fail to list orders from the db; never get to try sending orders from the filestore
2. Successfully list/send orders from the db; never get to calling satellite endpoint for submitting filestore orders
3. Successfully list/send orders from the db; successfully list filestore orders, but satellite endpoint fails (with "unauthenticated" drpc error)

The code change here add the following to address these issues:
- modify the query for ordersDB.listUnsentBySatellite so that we no longer select expired orders from the unsent_orders table
- always process any orders that are in the ordersDB and also any orders stored in the filestore
- add monkit monitoring to filestore.ListUnsentBySatellite so that we can see the failures/successes

Change-Id: I0b473e5d75252e7ab5fa6b5c204ed260ab5094ec
2020-10-21 15:02:23 +00:00
paul cannon
360ab17869 satellite/audit: use LastIPAndPort preferentially
This preserves the last_ip_and_port field from node lookups through
CreateAuditOrderLimits() and CreateAuditOrderLimit(), so that later
calls to (*Verifier).GetShare() can try to use that IP and port. If a
connection to the given IP and port cannot be made, or the connection
cannot be verified and secured with the target node identity, an
attempt is made to connect to the original node address instead.

A similar change is not necessary to the other Create*OrderLimits
functions, because they already replace node addresses with the cached
IP and port as appropriate. We might want to consider making a similar
change to CreateGetRepairOrderLimits(), though.

The audit situation is unique because the ramifications are especially
powerful when we get the address wrong. Failing a single audit can have
a heavy cost to a storage node. We need to make extra effort in order
to avoid imposing that cost unfairly.

Situation 1: If an audit fails because the repair worker failed to make
a DNS query (which might well be the fault on the satellite side), and
we have last_ip_and_port information available for the target node, it
would be unfair not to try connecting to that last_ip_and_port address.

Situation 2: If a node has changed addresses recently and the operator
correctly changed its DNS entry, but we don't bother querying DNS, it
would be unfair to penalize the node for our failure to connect to it.

So the audit worker must try both last_ip_and_port _and_ the node
address as supplied by the SNO.

We elect here to try last_ip_and_port first, on the grounds that (a) it
is expected to work in the large majority of cases, and (b) there
should not be any security concerns with connecting to an out-or-date
address, and (c) avoiding DNS queries on the satellite side helps
alleviate satellite operational load.

Change-Id: I9bf6c6c79866d879adecac6144a6c346f4f61200
2020-10-21 13:34:40 +00:00
Yaroslav Vorobiov
25df79a6bf storagenode-updater: check binary version on self-update
Check binary version on self-update instead of current process
version to prevent updating already updated binary.
Add info logs to report current version of service beeing
updated.

Change-Id: Id22dee188a99d6d45db925104786f49f5d3a61ae
2020-10-21 10:54:26 +00:00
Ivan Fraixedes
979ee762ba
satellite/console/consoleweb: Fix typo in method name
Fix a typo in the graphQL handler method name.

Change-Id: I038c7783073f7bed95353f56a8a24520c724a5b6
2020-10-21 11:58:37 +02:00