fileInfo is, of course, nil here, so we can't use fileInfo.Name() to
report the location of the problem.
Change-Id: Ia858a3795b127da0fc812d0a158b36ad344ff76b
`storage.KeyValueStore` requires ordered iteration, which redis
doesn't support natively. This would require loading all the keys
into memory and then processing them, rather than iterating over them
one-by-one.
This adds a temporary `IterateUnordered` to handle the migrations
more gracefully.
Change-Id: I55b763500523077c7ab8fdfad175c32cc7788e47
A user on the forum was seeing the error "bad message", which was not
very helpful. This case from the ext4 filesystem using the code EBADMSG
to indicate it detected an invalid CRC, suggesting disk corruption.
This change adds some explanatory information about probable disk
corruption to all errors coming from the (*blobInfo).Stat() call, which
is where storagenode fs corruption problems will usually manifest.
Refs: https://github.com/storj/storj/issues/5375
Change-Id: I87f4a800236050415c4191ef1a0fc952f9def315
Calling stat() (really, lstat()) on every file during a directory walk
is the step that takes up the most time. Furthermore, not all directory
walk uses _need_ to have a stat done on every file. Therefore, in this
commit we avoid doing the stat at the lowest level of
walkNamespaceInPath. The stat will still be done when it is requested,
with the Stat() method on the blobInfo object.
The major upside of this is that we can avoid the stat call on most
files during a Retain operation. This should speed up garbage collection
considerably.
The major downside is that walkNamespaceInPath will no longer
automatically skip over directories that are named like blob files, or
blob files which are deleted between readdir() and stat(). Callers to
walkNamespaceInPath and its variants (WalkNamespace,
WalkSatellitePieces, etc) are now expected to handle these cases
individually.
Thanks to forum member Toyoo for the insight that this would speed up
garbage collection.
Refs: https://github.com/storj/storj/issues/5454
Change-Id: I72930573d58928fa25057ed89cd4ec474b884199
Full file sync before saving files to piecestore seems to be very expensive.
Bwfore we do any step to eliminate them we are planning to do more measurement which requires a temporary flag to turn it off. (not for production).
Change-Id: I5cb8f8cb348ca3590fb5eae14d02edb3f0424617
We are not using the benchmark results for anything, they are mostly
there to ensure that we don't break the benchmarks. So we can disable
CockroachDB for them.
Similarly add short versions of other tests.
Also try to precompile test/benchmark code.
Change-Id: I60b501789f70c289af68c37a052778dc75ae2b69
Currently TextMaxVerifyCount flakes in some tests, try increasing the
sleep time to ensure that things are slow enough to trigger the error
condition.
Also pass ctx to all the funcs so we can handle sleep better.
Change-Id: I605b6ea8b14a0a66d81a605ce3251f57a1669c00
errs.Class should not contain "error" in the name, since that causes a
lot of stutter in the error logs. As an example a log line could end up
looking like:
ERROR node stats service error: satellitedbs error: node stats database error: no rows
Whereas something like:
ERROR nodestats service: satellitedbs: nodestatsdb: no rows
Would contain all the necessary information without the stutter.
Change-Id: I7b7cb7e592ebab4bcfadc1eef11122584d2b20e0
Initially we duplicated the code to avoid large scale changes to
the packages. Now we are past metainfo refactor we can remove the
duplication.
Change-Id: I9d0b2756cc6e2a2f4d576afa408a15273a7e1cef
Rename the functions that are prefixed with 'New' which connect with
Redis by 'Open' to make clear that they perform network operations.
Change-Id: I1351e89a642e8e2c2586626646315ad0fb2c6242
Update the Redis dependency to use the last major production version.
The last version accepts a context parameter in all the network methods
so it allows us to pass it through them.
Change-Id: I34121b2ec3c2728602115c724933ad24c9e6e4fd
Move a specific interface & types used for testing to be a private
subpackage with a name that clearly identifies it for testing purpose.
Change-Id: I646cf3b6f0a3b518a6f9a125998dc5a02df02db6
We have to adapt the live accounting to allow the packages that use it
to differentiate about errors for being able to ignore them and make our
satellite resilient to Redis downtime.
For differentiating errors we should make changes in the live accounting
but also in the storage/redis.Client, however, we may need to do some
dirty workarounds or break other parts of the implementation that
depends on it.
On the other hand we want to get rid of the storage/redis.Client because
it has more functionality that the one that we are using and some
process has been started to remove it.
Hence, we have refactored the live accounting to directly use the Redis
client library for later on (in a future commit) adapt the satellite for
being resilient to Redis downtime.
Last but not least, a test for expired bandwidth keys have been added
and with it a bug was spotted and fix it.
Change-Id: Ibd191522cd20f6a9a15e5ccb7beb83a678e530ff
periodically create and delete a temp file in the storage directory
to verify writability. If this check fails, shut the node down.
Change-Id: I433e3a8d1d775fc779ae78e7cf3144a05ffd0574
Sometimes SNOs fail to properly configure or lose connection to their storage directory
which can result in DQ. This causes unnecessary repair and is unfortunate for all parties.
This change introduces the creation of a special file in the storage directory at runtime
containing the node ID. While the storage node runs, it periodically verifies that it can
find said file with the correct contents in the correct location. If not, the node will
shut down with an error message.
This change will solve the issue of nodes losing access to the storage directory, but it will not
solve the issue of nodes pointing to the wrong directory, as the identifying file is created each
time the node starts up. After this change has been the minimum version for a few releases, we will
remove the creation of the directory-identifying file from the storage node run command and add it
to the setup command.
Change-Id: Ib7b10e96ac07373219835e39239e93957e7667a4
To prevent storagenode from implicitly recreating missing dbs and storage,
as such behaviour leads to audit failures. Do not allow storagenode to
start if any of dbs or storage is missing, corrupted, or dedicated storage disk is
unmounted, to get downtime instead.
Change-Id: Ic64e1f0ff4d8ef5b2fddbe7a7e53df4f4bd8652e
What:
Use the github.com/jackc/pgx postgresql driver in place of
github.com/lib/pq.
Why:
github.com/lib/pq has some problems with error handling and context
cancellations (i.e. it might even issue queries or DML statements more
than once! see https://github.com/lib/pq/issues/939). The
github.com/jackx/pgx library appears not to have these problems, and
also appears to be better engineered and implemented (in particular, it
doesn't use "exceptions by panic"). It should also give us some
performance improvements in some cases, and even more so if we can use
it directly instead of going through the database/sql layer.
Change-Id: Ia696d220f340a097dee9550a312d37de14ed2044
This runs each benchmark for one iteration to ensure that they are
valid. Unfortunately, it does not give any useful metrics as output.
Change-Id: I68940398c8dd849aed656bd12656f48d5df10128
errors.New errors will not show up in monkit tracing
as a useful error type. this change fixes a test (!)
and makes it so monkit will tell us what the error
type is, if we have this failure
Change-Id: Ic9933704e4095495c7ee286d9df3eb7eb94b25c9
A file piece could be deleted in between walking the list of files read
from a directory and before we actually perform any operation on such
file. When that happens, we don't want to return an error, we want to
just ignore it and carry on.
Change-Id: I8f6986070e5883599a08fccf8b125c075b30fe1b
This ensures that rows are closed to avoid leaks.
Also verifies that Err() is called, to ensure that no
error is left behind.
Change-Id: Idd1bec9bf479f40021da67b2c80ce83033149469