No trash-cleanup?

jammerdan · August 19, 2024, 11:17pm

Node is 1.110.3, startup scan disabled, lazy filewalker enabled.
When I grep for “trash-cleanup” on other nodes, I get results.
Not on this node.
Logs go back until 17th. When I grep for trash, trash-cleanup, retain, bloom, gc, lazy nothing comes up.
However there are date folders old enough to get deleted:

ls /trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/
2024-06-21  2024-06-25  2024-07-04  2024-07-08  2024-07-14  2024-07-23  2024-08-05  2024-08-06  2024-08-08  2024-08-15

I can try a restart and see if it resolves it. But it looks strange.

EasyRhino · August 20, 2024, 12:05am

I checked one node, I also have some obsolete trash in my ukfu directory (8/1 and 8/2)… but the other satellites seem to have only current trash.

checked another couple nodes, they didn’t have obsolete trash in ukfu folder.

jammerdan · August 20, 2024, 12:42am

It seems there is no process running on the node that takes care of the trash.

Alexey · August 20, 2024, 3:29am

It should start every hour. You may search for "\strash" and "started|completed|finished|error"

jammerdan · August 20, 2024, 5:05am

docker logs storagenode | grep "\strash" | wc -l
0

docker logs storagenode | grep "started|completed|finished|error" | wc -l
0

I have no results.

Strange. Is there a config setting that I might have changed for that node?

Alexey · August 20, 2024, 7:25am

Maybe you switched the log level to FATAL?
What’s /mon/ps (on the debug port) show?

jammerdan · August 20, 2024, 10:09am

--log.level info \

It shows many uploads like this

[4297863461398401505,2950150802903170147] storj.io/common/rpc/rpctracing./piecestore.Piecestore/Upload()

[2870453949255849715,1558329231107075052] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 137h45m15.553853817s, orphaned)
[8355665620667357570,1558329231107075052] storj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash() (elapsed: 119h19m55.380969132s)
 [6286762086551895438,1558329231107075052] storj.io/storj/storagenode/pieces.(*Store).Trash() (elapsed: 5.89909061s)

Could it be that the tasks are running for so long that the starting log entries have been deleted in the meantime and that there are no log lines indicating progress while the tasks are still running?

Alexey · August 21, 2024, 3:52am

Yes, it’s possible. The default log drive for docker uses 5 log files 20MB each, so 100MB in total. This is why our instruction for Raspberry Pi suggests to have 10 logs with 50MB each, however, I believe that it’s not enough for the month (the log file on my biggest node is 3.6GB already).

However, I do not see a trash cleanup filewalker in the /mon/ps output, but this method shows only the currently running processes.
You may also check the trash in the /mon/funcs to see did it run at all.

jammerdan · August 21, 2024, 5:44am

Still no trash-cleanup in the logs. I think I am going to restart this node.

Alexey · August 21, 2024, 5:55am

Did you check the trash folders? It’s possible, that the trash filewalker is still running and deletes pieces from the trash.
It’s working one by one for the satellites’ folders.
Also /mon/funcs should have mention of trash and some stat. But if it’s still running, it should appear in the /mon/ps output too.

jammerdan · August 22, 2024, 1:41am

When I checked today I found this:

2024-08-21T12:13:05Z    INFO    lazyfilewalker.trash-cleanup-filewalker subprocess exited with status

This suggests that it was indeed running. Subsequently the filewalker restarted with the expected log message.
So it appeared to be running all the time while the logs got deleted,
This brings back one of my suggestions:

It would be good if such long running tasks would have a log message while they are running and what they are currently doing (like which date folder currently being worked on)
This would make tracking progress much easier.

Alexey · August 22, 2024, 8:17am

So, lazy mode must be off for this node, it cannot get needed IOPS from the disk with a low I/O priority. It’s also likely run in parallel.

for that you likely need to switch it to a debug log level.
You may switch the log level on the fly: Guide to debug my storage node, uplink, s3 gateway, satellite.

Or use /mon/funcs or /mon/stats to see the trash cleanup process metrics. You also should see it in /mon/ps if it’s currently running.

jammerdan · August 22, 2024, 8:53am

IMHO that would perfectly fit the definition for the INFO log level:

Log Levels Explained and How to Use Them | Better Stack Community

The INFO level captures events in the system that are significant to the application’s business purpose. Such events are logged to show that the system is operating normally.

Alexey · August 23, 2024, 2:26am

Unfortunately I think the housekeeping is never considered as a business purpose of the node, which should handle uploads and downloads and keep the customers’ data unmodified and haven’t lost.

If you want details - you always can increase the log level. Unlike many applications - on the fly. You may also use a debug port to track all housekeeping processes without increasing a log level. Of course you wouldn’t see a gauge or ETA, but I guess it’s hard to estimate.
It’s also possible to use a custom log level for services, in that case you need to increase it for the process in question.

jammerdan · August 23, 2024, 3:52am

I don’t agree to that. I think it is part of the overall purpose of the node. And why is there a log entry for the start of the trash-cleanup on INFO level then? I see the start, the progress and the stop all on INFO level. It is not debugging to check if such a progress is currently running it is verifying that the software is doing what it should do.

Alexey · August 23, 2024, 4:23am

I’m strongly disagree to print a progress on the info log level. Printing a progress on INFO means that we also need to do the same for all operations, like GET*, PUT*. This doesn’t make sense. There should be only start, finish or failure.

The start and finish messages are sounds reasonable for the info log level. The progress of the said filewalker - is not. It should be in debug or even trace.

By the way, for trace you may also use /debug/run/trace/db, it would wait for the process to start and would collect the info and print it. There are also /mon/trace/json and /mon/trace/svg too.
The /mon/ps method would provide you how long the process is running right now, the /mon/funcs or /mon/stats will provide the stat.
You may check the article above how to provide parameters.

The progress should not be printed on INFO, it’s too verbose. If you would like to have it, you have plenty of options: you may enable debug or trace log level, including the custom log level for a specific process, you may use a debug port (which is designed exactly for your request). Why should other nodes suffer?

jammerdan · August 23, 2024, 4:30am

No, because these are not long running processes I am talking about.

As stated my logs had covered like 3 days. From that I would be happy with a log entry every 6 or 12 hours, maybe even only once a day stating that it is still running.
I don’t see that this would be too verbose.

Alexey · August 23, 2024, 4:34am

You may submit an idea here: Storage Node feature requests - voting and let the Community to decide.
I still believe, if ones wants to see a progress they can always use a debug port to switch to the debug level or use the implemented debug methods.

For me this is enough:

2024-08-19T14:08:14Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-08-19T14:17:53Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 134017501440, "Total Pieces Content Size": 133800878336}

2024-08-22T22:42:34Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-08-18T17:59:59Z", "Filter Size": 247340, "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-08-22T23:09:02Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 5080, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 423308, "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Duration": "26m27.2845381s", "Retain Status": "enabled"}

2024-08-22T12:26:22Z   INFO    pieces:trash    emptying trash started  {"Process":"storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-22T13:48:15Z   INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "1h21m52.6574297s"}

jammerdan · August 23, 2024, 4:54am

I may not have been clear enough. While using the debug port can be useful for closely following progress, it requires additional setup. However, that’s not my main point. I’m specifically thinking of situations where the logs don’t indicate that a long running process is still running, which is what led to this thread. The INFO-level logging I’m referring to is not about logging the process’s activities realtime completely, but rather about providing a periodic “sign of life” to indicate that it’s still running, and nothing more.

Alexey · August 23, 2024, 5:54am

Then you again can solve it without a single code string: redirect logs to the file and keep it at least for a week.
Or configure the docker log driver to keep more logs, than 100MiB by default (5 x 20MiB).
The problem is solved.

You didn’t convince me, sorry. We must not spend developers time on what can be changed in the local setup and is already supported.