Commit Graph

20 Commits

Author SHA1 Message Date
paul cannon
fd55dad735 storagenode/retain: don't quit on error
It has been noted in the forum that, during a Retain operation, when a
piece can't be deleted, the process never completes. The error is
written to the log, but the completion line "Moved pieces to trash
during retain" never is.

This `return` line is the reason. We should instead continue the loop.

Change-Id: I0f51d34aba0e81ad60a75802069b42dc135ad907
Refs: https://github.com/storj/storj/issues/6482
2023-11-06 10:54:46 -06:00
Clement Sam
e0542c2d24 storagenode: run garbage collection filewalker as a low I/O subprocess
Updates https://github.com/storj/storj/issues/5349

Change-Id: I7d810d737b17f0b74943765f7f7cc30b9fcf1425
2023-05-02 19:43:38 +00:00
paul cannon
2f04e20627 storage/filestore: better error message on data corruption
A user on the forum was seeing the error "bad message", which was not
very helpful. This case from the ext4 filesystem using the code EBADMSG
to indicate it detected an invalid CRC, suggesting disk corruption.

This change adds some explanatory information about probable disk
corruption to all errors coming from the (*blobInfo).Stat() call, which
is where storagenode fs corruption problems will usually manifest.

Refs: https://github.com/storj/storj/issues/5375
Change-Id: I87f4a800236050415c4191ef1a0fc952f9def315
2023-01-30 08:54:06 -06:00
paul cannon
ed7c82439d storage/filestore: avoid stat() during walkNamespaceInPath
Calling stat() (really, lstat()) on every file during a directory walk
is the step that takes up the most time. Furthermore, not all directory
walk uses _need_ to have a stat done on every file. Therefore, in this
commit we avoid doing the stat at the lowest level of
walkNamespaceInPath. The stat will still be done when it is requested,
with the Stat() method on the blobInfo object.

The major upside of this is that we can avoid the stat call on most
files during a Retain operation. This should speed up garbage collection
considerably.

The major downside is that walkNamespaceInPath will no longer
automatically skip over directories that are named like blob files, or
blob files which are deleted between readdir() and stat(). Callers to
walkNamespaceInPath and its variants (WalkNamespace,
WalkSatellitePieces, etc) are now expected to handle these cases
individually.

Thanks to forum member Toyoo for the insight that this would speed up
garbage collection.

Refs: https://github.com/storj/storj/issues/5454
Change-Id: I72930573d58928fa25057ed89cd4ec474b884199
2023-01-30 13:47:03 +00:00
Erik van Velzen
e6b5501f9b satellite/gc/sender: new service to send retain filters
Implement a new service to read retain filter from a bucket and
send them out to storagenodes.

This allows the retain filters to be generated by a separate command on
a backup of the database.

Paralellism (setting ConcurrentSends) and end-to-end garbage collection
tests will be restored in a subsequent commit.

Solves https://github.com/storj/team-metainfo/issues/121

Change-Id: Iaf8a33fbf6987676cc3cf74a18a8078916fe673d
2022-09-20 11:49:40 +00:00
Stefan Benten
c3171b4ba4
storagenode/retain: Move summary and start logs to info level (#4954)
We currently do not log the GC information/stats under normal circumstances.
This is not good for monitoring and troubleshooting.
2022-07-08 18:19:08 +02:00
Yaroslav Vorobiov
c9cfb5ed0c storagenode/retain: add more verbose monkit monitoring
Change-Id: Ibb9804268751b4b1842eb729bc510dba83e9b28b
2021-06-04 20:20:11 +00:00
Qweder93
53a5d18e1a storagenode: fixed logging about piece being moved to trash, and added logging when piece was actually deleted
Change-Id: I46f6a141b27033c2087b5c4681506d80b90f4a18
2020-08-02 20:00:05 +03:00
Egon Elbre
080ba47a06 all: fix dots
Change-Id: I6a419c62700c568254ff67ae5b73efed2fc98aa2
2020-07-16 14:58:28 +00:00
Jeff Wendling
7999d24f81 all: use monkit v3
this commit updates our monkit dependency to the v3 version where
it outputs in an influx style. this makes discovery much easier
as many tools are built to look at it this way.

graphite and rothko will suffer some due to no longer being a tree
based on dots. hopefully time will exist to update rothko to
index based on the new metric format.

it adds an influx output for the statreceiver so that we can
write to influxdb v1 or v2 directly.

Change-Id: Iae9f9494a6d29cfbd1f932a5e71a891b490415ff
2020-02-05 23:53:17 +00:00
Egon Elbre
6615ecc9b6 common: separate repository
Change-Id: Ibb89c42060450e3839481a7e495bbe3ad940610a
2019-12-27 14:11:15 +02:00
littleskunk
08947e177d storagenode/garbagecollection: enable in production
Change-Id: I627b7a37ca4a85eb19936ca2c7ca907d7cc63f5b
2019-12-16 22:44:04 +00:00
littleskunk
9d1faeee58 storagenode/garbagecollection: increase MaxTimeSkew to be higher than satellite MaxCommitInterval
Change-Id: I86f8d0b44bea3aa005ff26d52588611c59df5e9a
2019-12-09 16:03:55 +00:00
Isaac Hess
6aeddf2f53
storagenode/pieces: Add Trash and RestoreTrash to piecestore (#3575)
* storagenode/pieces: Add Trash and RestoreTrash to piecestore

* Add index for expiration trash
2019-11-20 09:28:49 -07:00
littleskunk
7eb6724c92
logging: unify logging around satellite ID, node ID and piece ID (#3491)
* logging: unify logging around satellite ID, node ID and piece ID

* unify segment index
2019-11-05 22:04:07 +01:00
Maximillian von Briesen
08ed50bcaa
satellite/metainfo: add commit interval to prevent long delays between order limit creation and segment commit (#3149) 2019-10-01 12:55:02 -04:00
Egon Elbre
a801fab66a
all: add archview annotations (#2964) 2019-09-10 16:24:16 +03:00
Egon Elbre
8a5db77e04
storagenode/retain: add comment (#2910) 2019-08-29 19:42:17 +03:00
Egon Elbre
62e3bf5b34 storagenode/retain: fix concurrency issues (#2828)
* nicer flags

* fix concurrency

* add concurrent workers

* initialize things

* fix tests

* close retain service

* ensure we don't have workers working on the same satellite

* ensure things compile

* fix other compilation issues:

* concurrency changes

ran this with `go test -count=1000` and it passed all of them.

- we add a closed channel so that we can select on it with
  context cancellation.
- we put a once in so we only close the channel once.
- every time the queue/running state changes, we have to broadcast
  because we may want to wake up N pending Wait calls or other
  concurrent workers.
- because we broadcast, we don't need to do the polling in Wait
  anymore.
- ensure Run doesn't start multiple times so that we don't have
  to worry about concurrent Close with multiple Runs.
- hold the lock while we start workers so that a concurrent Close
  with Run can't decide that there's nothing started and exit
  and then have Run start things.
- make sure to poll the closed/context channels through loops
  or at the start of Run calls in case Close happens first.
- these polls should be under a mutex because they have a default
  case which makes it possible to schedule such that Close hasn't
  executed the channel close so it starts more work.
- cancel a local Run context when it's going to exit to make sure
  that any retainPieces calls have a canceled context.
- hopefully enough comments to both check my work and help readers
  digest what's going on.

Change-Id: Ida0e226a7e01e8ae64fa2c59dd5a84b04bccfbd7

* use the retain error class

Change-Id: I1511eaef135f98afd57b878e997e4c8a0d11cafc

* concurrency fixes again

- forgot to update the gc test to use the old Wait api.
- we need to drop the lock while we wait for the workers
  to exit, because they may be blocked on the condition
  variable
- additionally, we need to broadcast when we close the
  signal channel because the state changed: they want
  to wake up and exit.

Change-Id: I4204699792275260cd912f29aa73720f7d9b14b5

* undo my misguided rename

Change-Id: I6baffe1eb0434e260212c485bbcc01bed3250881

* remove pollInterval

* format paragraph more nicely

* move skew calculation into retain pieces
2019-08-28 16:35:25 -04:00
Maximillian von Briesen
d83a965139
storagenode/piecestore: Add retain service on storagenode (#2785)
Add retain service on storagenode. This service runs retain jobs that have been queued by the storagenodes. Rather than running retain jobs during the grpc Retain() call, the grpc call queues a retain job to the retain service and returns immediately afterwards, removing a significant bottleneck in garbage collection.
2019-08-19 14:52:47 -04:00