Move storagnode/console caching headers to private/web. Also,
start using them in multinode/console/server.
Change-Id: I1f0f3c9833a183476009737cece515ae7537fb83
This change includes storagenode QUIC status on SNO dashboard.
If disabled, it displays warning for SNOs to foward their
UDP port for quic.
Change-Id: I8d28c9c0f5f1e90d80b7c18b9e1e7b78c5e45609
* storagenode/piecestore: Add upload and download metrics for Grafana alerts
* storagenode/piecestore: group download metrics by piece action
Change-Id: Ib2a42b60c56c3f581915d512f4907c8db71e4624
Co-authored-by: Clement Sam <clement@storj.io>
When something happens during opening or closing then it wasn't clear
which database had the issue.
Fixesstorj/storj#4271
Change-Id: I34c8bae79a5b41ccd9b40aa8d836805f8c1a573c
When a satellite operator runs restore trash command, it's hard for node
operator to know whether their node has received the request. Adding
logs so this process is visible from node operators perspective.
Change-Id: I08c670b1b0c5971e6b73e038a0935cb0caaca63d
Remove the logic associated to the old transfer queue.
A new transfer queue (gracefulexit_segment_transfer_queue) has been created for migration to segmentLoop.
Transfers from the old queue were not moved to the new queue.
Instead, it was still used for nodes which have initiated graceful exit before migration.
There is no such node left, so we can remove all this logic.
In a next step, we will drop the table.
Change-Id: I3aa9bc29b76065d34b57a73f6e9c9d0297587f54
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.
Also pass ctx to all the funcs so we can handle sleep better.
Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration and made accessible through graceful exit db in another one.
This changes makes graceful exit enqueue transfer items for new exiting nodes in the new table.
Change-Id: I7bd00de13e749be521d63ef3b80c168df66b9433
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration.
This change gives access to this table.
Graceful Exit doesn't use the table yet, this will be done in a next change.
Change-Id: I6c09cff4cc45f0529813a8898ddb2d14aadb2cb8
Estimate speed of the uploads by calculating the time a client has been in a non-congested state. Storage node is uncongested when it has used up over 80% of its maximum concurrent requests capacity.
Currently it's disabled by default.
fillInBlobAccess was using a non-pointer receiver so the receiver wasn't
being modified. Luckily, this seems to be only being used in tests.
Change-Id: Ice01419933295562d558d48ba314d476660b67bd
Added expectations endpoint (estimations and distributed), added
coalesce to db query, so in case of empty payouts db 0 will be returned instead of error.
Change-Id: I535f14ef097876448d8949bc302895b25da2b6e7
Currently, if a node has untrusted a satellite, the satellite can still
successfully ping the node. If a node decide to untrust a satellite, the
satellite should also mark it as conact failed
Change-Id: Idf80fa00d9849205533dd3e5b3b775b5b9686705
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
Initially there were pkg and private packages, however for all practical
purposes there's no significant difference between them. It's clearer to
have a single private package - and when we do get a specific
abstraction that needs to be reused, we can move it to storj.io/common
or storj.io/private.
Change-Id: Ibc2036e67f312f5d63cb4a97f5a92e38ae413aa5
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
QUIC
We want to encourage storagenodes to open their udp port. This PR
changes contact service in satellite to try to connect to nodes through
QUIC. If satellite can't reach nodes through quic, it will send an error
message back to nodes. On the nodes side, it will always log out error
message from check in if the error message is not empty.
Whether satellite can reach nodes through quic has no affect on nodes'
uptime check.
Change-Id: I5ebf80f921c4a6504997d83c8bd45226da9d3703
This is a temporary precaution to avoid incorrectly auditing nodes for pieces that were deleted between database backups if we have to restore from a previous backup.
Here we send pieces to trash rather than directly deleting them from storage nodes so we can restore from trash after a db restoration.
Change-Id: Icd979d2a9a755e7428190c0129c9bc969649d544
TestStorageNodeApi is failing due to very slight differences in
float values. This change rounds these values to 3 decimal places before
comparing them.
Change-Id: Ic7fae3a5e0a0a942c03d982bfa7b19357f2e3d2e
while satellites have also run this logic, old satellites that
no longer exist cannot and so the node cannot get the updated
data. this locally migrates it so that the calculations for
the undistributed amounts are correct.
there's also some tab/space whitespace and gofmt fixes.
Change-Id: I470879703314fe6541eaba5f21b47849781894f8
Remove the orders Settlement endpoint because it isn't used and it was
already always returning an error.
Change-Id: I81486fbe7044a1444182173bc0693698ee7cfe7e
allow disabling tcp/quic
In order to have more control of a server so that we can
simulate connection failures in `testplanet`, this PR changes
quic.Listener to accept an existing UDPConn instead of relying on the
quic-go library to create the UDPConn.
This PR also adds two flags on the `server.Config` struct to allow
enabling/disabling tcp/tls listener and quic listener. By default, they
are both set to true.
- `DisableTCPTLS`: internal flag, disables tcp/tls listener.
- `DisableQUIC`: hidden flag, disables quic listener
By making the `DisableQUIC` a hidden flag, it allows storagenode operators to
have the ability to disable quic traffic in case their set up can't work
with udp traffic.
Change-Id: I853b12435d988b9c41ad9b873fd57480d792e378
Delete satellite order methods and DB tables which aren't used anymore
after we have done a refactoring on the orders to stuck bucket
information in the orders' encrypted metadata.
There are also configuration parameters and a satellite chore that
aren't needed anymore after the orders refactoring.
Change-Id: Ida3682b95921df70792284b42c96d2508bf8ca9c
From the name of the function and from the way it is used (only called
in one place, from "storj.io/storagenode/gracefulexit".(*Chore).Run()),
it should not return graceful exits that have already completed.
In particular, this causes a problem in the case that a node has already
completed a graceful exit from one satellite, after which the satellite
was decommissioned and no longer in the "trusted" list. This causes an
error message to show up in the node logs every single minute like
"failed to get satellite address ... satellite \"X\" is untrusted".
https://forum.storj.io/t/error-gracefulexit-service-failed-to-get-satellite-address/11372
This change causes ListPendingExits to list pending exits only, not all
exits.
Correspondingly, the check for whether an exit is already completed, in
(*Chore).Run(), becomes unnecessary and is here removed.
Change-Id: Ia3e9bb3e92be4a32ebcbda0321e3fe61d77deaa8
When using calling time.Now() multiple times, they can cross
month boundary causing errors in calculations.
Change-Id: I66b5be7598f3bf475b4b5fe0dcce82eee55b3134
Full scope:
storagenode/{console,nodestats,notifications,reputation,storagenodedb},
web/storagenode
These columns are deprecated. They used to be for the uptime reputation
system which has been replaced by downtime tracking with audits.
Change-Id: I151d6569577d89733ac97af21a1d885323522b21
On servers with non-UTC it would have calculated a different month boundary.
If node joined in current month calculations will be related on amount of days node've been working.
Change-Id: Ie572b197f50c6cdff5a044a53dfb5b9138f82f24
Full prefix: satellite/{overlay,nodestats},storagenode/{reputation,nodestats}
Allow the storagenode to receive its audit history data from the
satellite via the satellite's GetStats endpoint.
The storagenode does not save this data for use in the API yet.
Change-Id: I9488f4d7a4ccb4ccf8336b8e4aeb3e5beee54979
Add the upload size to the log lines of the storagenode Upload endpoint
to provides the information to Storage node operators.
Change-Id: Ife661d28be72c2bf02579093e21fa811566ac8dd
We want to stop using the serial_numbers table in satelliteDB. One of the last places using the serial_numbers table is when storagenodes settle orders, we look up the bucket name and project ID from the serial number from the serial_numbers table.
Now that we have support to add encrypted metadata into the OrderLimit, this PR makes use of that and now attempts to read the project ID and bucket name from the encrypted orderLimit metadata instead of from the serial_numbers table. For backwards compatibility and to ensure no errors, we will still fallback to the old way of getting that info from the serial_numbers table, but this will be removed in the next release as long as there are no errors.
All processes that create orderLimits must have an orders.encryption-keys set. The services that create orderLimits (and thus need to encrypt the order metadata) are the satellite apiProcess, the repair process, audit service (core process), and graceful exit (core process). Only the satellite api process decrypts the order metadata when storagenodes settle orders. This means that the same encryption key needs to be provided in the config for the satellite api process, repair process, and the core process like so:
orders.include-encrypted-metadata=true
orders.encryption-keys="<"encryptionKeyID>=<encryptionKey>"
Change-Id: Ie2c037971713d6fbf69d697bfad7f8b672eedd66
Before manipulating order information on storagenodes we need to wait
for the orders to propagate to the database. Some of that happens
async with uplink.
Change-Id: Iaacfd7db0909ab5d2831d06388e5fb27b6d4778f
Define constants of 32 KiB as the upper limit of the marshalled order
and limit protobuf sizes. This value gives lots of buffer in case the
protobufs ever change, but is not as extreme as what we were doing
before in V0 files, which was to use the Uint32 max value.
Change-Id: I0914d17dde3b044b2611af33f931d46d55f81e98
Fix the error message reported by a wrong order size due to passing the
wrong variable to the interpolation pattern.
Change-Id: Ic0059615c60cfa33a26d4aeb0ebda5e586f0df05
`make` built function to build a new slice with a negative
length panics.
`make` length parameter is of `int` type.
These changes avoid that `make` panics on 32 bits architecture due to
the fact that `int` type is a `int32` an uint32 value can be over the
maximum `int32`, and when that happens the length parameter value
becomes negative and makes `make` to panic.
Change-Id: Ife9ab5993916d6dcf5584b37c208272269cb2b45
If the satellite fails to pingback the storage node during CheckIn
an error message is returned to the node in the response, but the actual
error value returned is nil. We are only checking the error. This means
the node has no feedback about the failure, and the node also does not
attempt to retry the connection.
Change-Id: Iaed00e422ba91af573e72255cc6671ea97928eae
This change fixes two things which can make reading from a corrupted
orders file inefficient.
* When a corrupted order is detected, but the underlying error is an
UnexpectedEOF (as opposed to a pb.Unmarshal error, for instance), there
is no point in attempting to read from the file another time to find an
additional uncorrupted order - we will continue to get UnexpectedEOF
errors until we seek to the very end of the file and get a normal EOF.
Instead, when UnexpectedEOF occurs, log and send metrics as with other
types of corruption, but do not attempt to read again.
* When a corrupted order is detected, instead of seeking forward only
one byte for the next attempt, seek forward by the size of entryHeader.
This cuts down on the number of iterations needed to find an uncorrupted
order after detecting a corrupted one.
Change-Id: Ie1a613127e29d29318584ec7f60e8f7554f73487
Previously, we created a new file to use for directory verification
every time the storage node starts. This is not helpful if the storage node
points to the wrong directory when restarting. Now we will only create the file
on setup. Now the file should be created only once and will be verified at
runtime.
Change-Id: Id529f681469138d368e5ea3c63159befe62b1a5b
We shouldn't have any EOF issues with recent drpc fix, let's reenable
and see whether it's still flaky.
Change-Id: I0de312bcb087c7f70ec9d3281d73d86f971845d5
Make metainfo.RSConfig a valid pflag config value. This allows us to
configure the RSConfig as a string like k/m/o/n-shareSize, which makes
having multiple supported RS schemes easier in the future.
RS-related config values that are no longer needed have been removed
(MinTotalThreshold, MaxTotalThreshold, MaxBufferMem, Verify).
Change-Id: I0178ae467dcf4375c504e7202f31443d627c15e1
Missed one case of Unmarshal in the previous commit for V0 files (0f4e4969b7)
In V1, unmarshalling was being attempted before the checksum was
verified, so this commit moves those calls to the end of the V1 ReadOne
function.
Change-Id: Ic0b49f0bbc91fb61fb28af6003060994d0af22ed
In V0 orders files, unexpected EOF is correctly treated as a file
corruption, but pb.Unmarshal can also fail, and this is not treated as a
file corruption. This commit fixes that.
Change-Id: I6b446a10f4b1a5a44e832cbcc9bf8b2548cfcfeb
In production we are seeing ~115 storage nodes (out of ~6,500) are not using the new SettlementWithWindow endpoint (but they are upgraded to > v1.12).
We analyzed data being reported by monkit for the nodes who were above version 1.11 but were not successfully submitting orders to the new endpoint.
The nodes fell into a few categories:
1. Always fail to list orders from the db; never get to try sending orders from the filestore
2. Successfully list/send orders from the db; never get to calling satellite endpoint for submitting filestore orders
3. Successfully list/send orders from the db; successfully list filestore orders, but satellite endpoint fails (with "unauthenticated" drpc error)
The code change here add the following to address these issues:
- modify the query for ordersDB.listUnsentBySatellite so that we no longer select expired orders from the unsent_orders table
- always process any orders that are in the ordersDB and also any orders stored in the filestore
- add monkit monitoring to filestore.ListUnsentBySatellite so that we can see the failures/successes
Change-Id: I0b473e5d75252e7ab5fa6b5c204ed260ab5094ec
WHAT:
changed estimation table row order.
WHY:
to show gross total for selected period to avoid misunderstanding
when held amount is bigger than paid multiple times.
Change-Id: I03881c8af682372139a378030acf04f199d3260b
Use tagsql.DB pointer as step database, to propagate changes
back and forth between actual database and migration.
Adds CreateDB operation to the migration step to be able to
create new dbs before executing migration action.
Adjusts storagenode database migration to use inner tagsql.DB
pointer of each database as step.DB.
Adjusts satellite dabase migration, adds proxy migrationDB field
to satellite db that wraps itself as tagsql.DB, pointer of which
is used as step.DB.
Change-Id: Ifed4de5b01a356cf7b37db64d2eaeb7b61982c5c
Right now, the best way for a storage node operator to get the current
space used for each satellite is to run the `storagenode exit-satellite`
command for graceful exit, and cancel at the second confirmation prompt.
This is convoluted and the data is readily available from the Blobs
Usage Cache.
This change adds the current space used by each satellite to the
endpoints `/api/sno` and `/api/sno/satellite/<Satellite ID>`
Change-Id: I2173005bb016fc76db96fd598d26b485e5b2aa0b