we may in the future want to accept orders as part of the initial
request message
(e.g. https://review.dev.storj.io/c/storj/uplink/+/9246).
this change is forward compatible but continues to work with
existing clients.
Change-Id: I475ad50d6cbfee8a1f843383230698e4ef9b9e54
This ensures that empty trash and restore trash cannot run at the same
time.
Fixes https://github.com/storj/storj/issues/5416
Change-Id: I9d2e3aa3d66e61e5c8a7427a95208bb96089792d
* normalize ExistsCheckWorkers flag to avoid setting incorrect values
* wait for limiter to finish if context was canceled
Change-Id: I688a395bd958cd09233605fc264d43f91ec45ed1
Adds new method Exists which can be used to verify which
requested piece ids exists on storage node. Will verify only pieces
which belongs to the satellite that used that endpoint.
Minum WASM size was increased a bit.
https://github.com/storj/storj/issues/5415
Change-Id: Ia5f9cadeb526541b2776a8973eb7d50133ad8636
Starting restore trash in the background allows the satellite to
continue to the next storagenode without needing to wait until
completion.
Of course, this means the satellite doesn't get feedback whether it
succeeds successfully or not. This means that the restore-trash needs to
be executed several times.
Change-Id: I62d43f6f2e4a07854f6d083a65badf897338083b
Previously because of the use of a LAG to calculate the hour_interval
the first record, which is usually the first day of the month usually,
doesn’t have a previous record and always assumes the at_rest_total is
for 24 hours.
Resolves https://github.com/storj/storj/issues/5390
Change-Id: Id532f8b38fe9df61432e62655318ff119a733d13
The query changes we did while fixing the usage graph led to wrong
payout calculations directly linked to disk space.
This change:
- avoids converting from Bh to B directly in the query
- returns the at_rest_total in the original bytes*hour value
- returns at_rest_total_bytes as the calculated disk spaced used in bytes
- uses the at_rest_total_bytes only for the disk space graph
- return summary_bytes as the average disk space used within the specified date
- updates the disk space graph header to "average disk space used this month"
The total disk used in the month is also displayed in B not B*day
Resolves https://github.com/storj/storj/issues/5355
Change-Id: I2cfefb0fe711f9c59de2adb547c4ab50b05c7cbb
Running all of the migrations necessary to initialize a storage node
database takes a significant amount of time during runs.
The package current supports initializing a database from manually coalesced
migration data (i.e. snapshot) which improves the situation somewhat.
This change takes things a bit further by changing the snapshot code to
instead hydrate the database directory from a pre-generated snapshot zip
file.
name old time/op new time/op delta
Run_StorageNodeCount_4/Postgres-16 2.50s ± 0% 0.16s ± 0% ~ (p=1.000 n=1+1)
Change-Id: I213bbba5f9199497fbe8ce889b627e853f8b29a0
The current implementation blocks the the startup until one or none
of the trusted satellites is able to reach the node via QUIC.
This can cause delayed startup. Also, the quic check is done
once during startup, and if there is a misconfiguration later,
snos would have to restart to node.
In this change, we reuse the contact service which pings the satellite
periodically for node checkin. During checkin the satellite tries
pinging the node back via both TCP and QUIC and reports both statuses.
WIth this, we are able to get a periodic update of the QUIC status
without restarting the node.
Also adds the time the node was last pinged via QUIC to the tooltip
on the QUIC status tab.
Resolves https://github.com/storj/storj/issues/4398
Change-Id: I18aa2a8e8d44e8187f8f2eb51f398fa6073882a4
Implement a new service to read retain filter from a bucket and
send them out to storagenodes.
This allows the retain filters to be generated by a separate command on
a backup of the database.
Paralellism (setting ConcurrentSends) and end-to-end garbage collection
tests will be restored in a subsequent commit.
Solves https://github.com/storj/team-metainfo/issues/121
Change-Id: Iaf8a33fbf6987676cc3cf74a18a8078916fe673d
The collector tries deleting a piece over and over again, though
the piece does not exist on the storagenode's filesystem.
We need to delete the piece info from the expired db if the
targeted file does not exist.
This does not resolve the base problem of why the file
is deleted before the collector tries deleting it.
This change deletes the piece info from the expired db
if the file does not exist, since we're already trying
to delete that piece anyway.
Closes https://github.com/storj/storj/issues/4192
Change-Id: If659185ca14f1cb29fd3c4237374df6fcd535df8
The collector calls the Delete() method on the pieces
which returns an error which is wrapped by many error classes.
Delete() method is using Stat() from
1aec831d98/storage/filestore/dir.go (L328)
under the hood.
os.IsNotExist(errors.Unwrap(err) will always be false unless
errors.Unwrap(err) is called multiple times till it gets to
the core os.ErrNotExist. Here is a test case to explain better:
func TestABC(t *testing.T) {
classA := errs.Class("A")
classB := errs.Class("B")
wrappedError := classB.Wrap(classA.Wrap(os.ErrNotExist))
require.True(t, os.IsNotExist(errs.Unwrap(wrappedError)))
require.True(t, os.IsNotExist(errors.Unwrap(wrappedError)))
}
Using errs.Is() seems to resolve this even without unwrapping the error:
func TestABC(t *testing.T) {
classA := errs.Class("A")
classB := errs.Class("B")
wrappedError := classB.Wrap(classA.Wrap(os.ErrNotExist))
require.True(t, errs.Is(wrappedError, os.ErrNotExist))
require.False(t, errs.Is(wrappedError, os.ErrExist))
require.False(t, os.IsNotExist(wrappedError))
}
Does not resolve the collector issue here but enhances it:
https://github.com/storj/storj/issues/4192
Change-Id: Ifb75dd15b54c1e1a5e23f6eba2d621d64874a5cc
Today each storagenode should have a port which is opened for the internet, and handles DRPC protocol calls.
When we do a HTTP call on the DRPC endpoint, it hangs until a timeout.
This patch changes the behavior: the main DRPC port of the storagenodes can accept HTTP requests and can be used to monitor the status of the node:
* if returns with HTTP 200 only if the storagnode is healthy (not suspended / disqualified + online score > 0.9)
* it CAN include information about the current status (per satellite). It's opt-in, you should configure it so.
In this way it becomes extremely easy to monitor storagenodes with external uptime services.
Note: this patch exposes some information which was not easily available before (especially the node status, and used satellites). I think it should be acceptable:
* Until having more community satellites, all storagenodes are connected to the main Storj satellites.
* With community satellites, it's good thing to have more transparency (easy way to check who is connected to which satellites)
The implementation is based on this line:
```
http.Serve(NewPrefixedListener([]byte("GET / HT"), publicMux.Route("GET / HT")), p.public.http)
```
This line answers to the TCP requests with `GET / HT...` (GET HTTP request to the route), but puts back the removed prefix.
Change-Id: I3700c7e24524850825ecdf75a4bcc3b4afcb3a74
Similar to the existing snapshot based tests of satellite/metabase db we make a migration here which is:
* dedicated to unit tests
* faster (with less steps)
* but safe: additional unit test ensures that the snapshot based migration and normal prod migration have the same results.
Change-Id: Ie324b09f64b4553df02247a9461ece305a6cf832
I don't know why the go people thought this was a good idea, because
this automatic reformatting is bound to do the wrong thing sometimes,
which is very annoying. But I don't see a way to turn it off, so best to
get this change out of the way.
Change-Id: Ib5dbbca6a6f6fc944d76c9b511b8c904f796e4f3
For the storagenode usage graph currently,
1. the graph is still likely to contain spikes on the first day of
the month for a newly setup storagenode because there's no previous
interval_end_time to deduct from.
2. if a node goes offline for too long, say the last usage report
in the storageusage cache has an interval_end_time of
2020-07-30 00:00:00+00:00 and later, it comes back online a few
days later, it requests for the storage usage from the satellite
starting from the current month, say 2021-01-01 00:00:00+00:00,
the calculated hours for the first day would be 48 hours and it
could be wrong because the cache is missing one day usage report.
This PR addresses second issue on the storagenode side by requesting
storage usage data, instead of just a month boundary, request for an
interval starting from the last day of the previous month to the
current day of the current month.
The first one will be a tradeoff and wouldn't really matter since
it will just be an issue on the first day the storagenode joined
the satellite.
Updates https://github.com/storj/storj/issues/4178
Change-Id: I041c56c412030ce013dd77dce11b0b5d6550927b
The satellite now returns the last interval_end_time for each
daily storage usage.
We need to store the interval_end_time in the storage usage cache.
Also, renamed interval_start to timestamp to avoid ambiguity since
the interval_start only stores just the date/day returned by the
satellite.
Updates https://github.com/storj/storj/issues/4178
Change-Id: I94138ba8a506eeedd6703787ee03ab3e072efa32
* Allow configure initial piece scan
* Update tests to include new flag
* Update default config specification
* Rename configuration flag
* Rename variable and fix formatting
* Fix format
* Fix typo
Co-authored-by: Stefan Benten <mail@stefan-benten.de>
Co-authored-by: Clement Sam <clementsam75@gmail.com>
Co-authored-by: littleskunk <jens.heimbuerge@googlemail.com>
In rare cases time frame between creating time.Now() variable and calling
service method that receives it (after API method call) was big enough to distort
current month expectations and make test fail with last digit difference in 1
Change-Id: Ib811492d62f6598a5c40a09de6a87bffeaa0a78e
It seems the tests relied on time.Now(), which might cause some
discrepancies in calculations. Use a fixed time.Now() rather than
recalculating.
As a sidefix, remove "Test" prefix from t.Run. These are unnecessary.
Change-Id: I1de903fcf0fcf46fc8e3acf2463e17239b8e3cc6
Remote closing during upload or download is entirely expected and
it shouldn't lead to an error in the log.
Bump drpc to get the version that contains correct error code
for it. Also bump errs, which contains a fix for .Has.
Fixes https://github.com/storj/storj/issues/4609
Change-Id: I9297cabcfdc4b3a2c19d478dc729f779a2aef0c3
Go can now directly embed files without relying on external tools.
This makes code use go:embed and avoid the external tooling.
go:embed requires files to be present in the embedded directory,
hence we need to add .keep to "dist" folder. We also add one to
public/.keep, such that it won't be deleted when building storagenode.
Change-Id: I8bef81236be6829ed37ed4c16ef693677b93a631
Move storagnode/console caching headers to private/web. Also,
start using them in multinode/console/server.
Change-Id: I1f0f3c9833a183476009737cece515ae7537fb83
This change includes storagenode QUIC status on SNO dashboard.
If disabled, it displays warning for SNOs to foward their
UDP port for quic.
Change-Id: I8d28c9c0f5f1e90d80b7c18b9e1e7b78c5e45609
* storagenode/piecestore: Add upload and download metrics for Grafana alerts
* storagenode/piecestore: group download metrics by piece action
Change-Id: Ib2a42b60c56c3f581915d512f4907c8db71e4624
Co-authored-by: Clement Sam <clement@storj.io>
When something happens during opening or closing then it wasn't clear
which database had the issue.
Fixesstorj/storj#4271
Change-Id: I34c8bae79a5b41ccd9b40aa8d836805f8c1a573c
When a satellite operator runs restore trash command, it's hard for node
operator to know whether their node has received the request. Adding
logs so this process is visible from node operators perspective.
Change-Id: I08c670b1b0c5971e6b73e038a0935cb0caaca63d
Remove the logic associated to the old transfer queue.
A new transfer queue (gracefulexit_segment_transfer_queue) has been created for migration to segmentLoop.
Transfers from the old queue were not moved to the new queue.
Instead, it was still used for nodes which have initiated graceful exit before migration.
There is no such node left, so we can remove all this logic.
In a next step, we will drop the table.
Change-Id: I3aa9bc29b76065d34b57a73f6e9c9d0297587f54
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.
Also pass ctx to all the funcs so we can handle sleep better.
Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
At some point we moved metabase package outside Metainfo
but we didn't do that for satellite structure. This change
refactors only tests.
When uplink will be adjusted we can remove old entries in
Metainfo struct.
Change-Id: I2b66ed29f539b0ec0f490cad42c72840e0351bcb
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration and made accessible through graceful exit db in another one.
This changes makes graceful exit enqueue transfer items for new exiting nodes in the new table.
Change-Id: I7bd00de13e749be521d63ef3b80c168df66b9433
For being able to use the segment metainfo loop, graceful exit transfers have to include the segment stream_id/position instead of the path. For this, we created a new table graceful_exit_segment_transfer_queue that will replace the graceful_exit_transfer_queue. The table has been created in a previous migration.
This change gives access to this table.
Graceful Exit doesn't use the table yet, this will be done in a next change.
Change-Id: I6c09cff4cc45f0529813a8898ddb2d14aadb2cb8
Estimate speed of the uploads by calculating the time a client has been in a non-congested state. Storage node is uncongested when it has used up over 80% of its maximum concurrent requests capacity.
Currently it's disabled by default.
fillInBlobAccess was using a non-pointer receiver so the receiver wasn't
being modified. Luckily, this seems to be only being used in tests.
Change-Id: Ice01419933295562d558d48ba314d476660b67bd
Added expectations endpoint (estimations and distributed), added
coalesce to db query, so in case of empty payouts db 0 will be returned instead of error.
Change-Id: I535f14ef097876448d8949bc302895b25da2b6e7
Currently, if a node has untrusted a satellite, the satellite can still
successfully ping the node. If a node decide to untrust a satellite, the
satellite should also mark it as conact failed
Change-Id: Idf80fa00d9849205533dd3e5b3b775b5b9686705
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
Initially there were pkg and private packages, however for all practical
purposes there's no significant difference between them. It's clearer to
have a single private package - and when we do get a specific
abstraction that needs to be reused, we can move it to storj.io/common
or storj.io/private.
Change-Id: Ibc2036e67f312f5d63cb4a97f5a92e38ae413aa5
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
metabase has become a central concept and it's more suitable for it to
be directly nested under satellite rather than being part of metainfo.
metainfo is going to be the "endpoint" logic for handling requests.
Change-Id: I53770d6761ac1e9a1283b5aa68f471b21e784198
QUIC
We want to encourage storagenodes to open their udp port. This PR
changes contact service in satellite to try to connect to nodes through
QUIC. If satellite can't reach nodes through quic, it will send an error
message back to nodes. On the nodes side, it will always log out error
message from check in if the error message is not empty.
Whether satellite can reach nodes through quic has no affect on nodes'
uptime check.
Change-Id: I5ebf80f921c4a6504997d83c8bd45226da9d3703
This is a temporary precaution to avoid incorrectly auditing nodes for pieces that were deleted between database backups if we have to restore from a previous backup.
Here we send pieces to trash rather than directly deleting them from storage nodes so we can restore from trash after a db restoration.
Change-Id: Icd979d2a9a755e7428190c0129c9bc969649d544