It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected

jammerdan · April 28, 2024, 1:42am

Closed: [storagenode] Lazyfilewalker optimization · Issue #6708 · storj/storj · GitHub

So let’s see when it is available and if it is working.

But wait, is that even for the used-space filewalker?

nerdatwork · April 28, 2024, 1:56am

There is merge-conflict for used space filewalker

jammerdan · April 28, 2024, 2:05am

Then lazyfilewalker optimization for storage node used spaced calculation · Issue #6900 · storj/storj · GitHub should be the relevant for the used-space filewalker, right?

nerdatwork · April 28, 2024, 2:07am

Yes and its under Code Review

Alexey · April 28, 2024, 8:50am

The last one storagenode/{pieces,blobstore}: save-state-resume feature for GC file… · storj/storj@2a8d0d4 · GitHub should be in 1.101.x

jammerdan · May 1, 2024, 7:09am

And it seems that it is progressing:

https://review.dev.storj.io/c/storj/storj/+/12806?tab=comments

Overall it looks good to me.

@elek

That would make it better with frequent restarts. Even if one round is finished, we may not need to re-calculate the usage if it was calculated recently…
Why is there no setting to control the frequency?

I see following cases:

Frequent restarts by SNO: If a node operator is testing a setting or something like that and restarts the node several time in a short period, I do not see the need for the filewalker to run each time.
This could be handled by a delayed start option. But the biggest painpoint was indeed that it would start from the beginning every time and this should go away. At least it would resume and come to an end at some point.
Errors: When there is an error during filewalking it is questionable if it should restart. Chance is that the error repeats. However it would depend on the kind of error. If it is simply something like memory starvation or slow disk, I don’t see a reason why it should not restart and resume calculating. As it would not start from beginning even if it fails again it should move forward and come to an and at some point as well.
Successful completion: When it has completed successfully there is normally no urgent need to run it again soon. AFAIK once we have current data this gets cached and updated by the actual up- and downloads. So a configurable setting when to run again after the last successful completion would help to prevent unnecessary filewalker runs. Maybe it would be enough to run it once a week only if we have actual data from a previous run.

Alexey · May 5, 2024, 1:14pm

and this fix is a part of the fix for this situation.

that’s part is easy - just restart the ~~potato~~ node.

that part I didn’t get. Why?

jammerdan · May 7, 2024, 4:21am

Thinking about it, how exactly will this work? I am not sure if this solves it.
2 painpoints I was thinking of:

Does used-space filewalker need to run immediately after a restart? As there are other processes on start as well consuming IOPS, used-space filewalker does not seem to be the most urgent one to run on start-up. So starting it could be delayed by a time set by SNO. This way, if SNO restarts the node within that timeframe the filewalker would not start.
Second was that it would start from the beginning every time. This gets solved indeed. But after it has finished and I restart the node, it would start again from the scratch, right? So let’s say it was running for 3 days and finally all data is up to date and when I restart the node on day 4 to increase storage space, then the filewalker starts again despite really being required to run again, right?

I believe I had observed that the filewalker restarts after some time if it has not completed successfully. This is why I have turned it off on some nodes. So node restart not required. But if it keeps running into the same issue over and over without moving forward, then it is a waste of resources and better stop. This could be done with a maximum retry setting or something.

Let me try to explain.
First, with such a periodic run I would be able to catch errors when other processes have issues updating the space like we are seeing currently with the trash.
Second it could also reduce the number of used-space filewalker runs altogether. Because either I can turn them off for running on restart and rely on the periodic runs. Or it runs on startup only if the period set has passed. Currently it runs on every startup (when run on startup set to true).

For example:
Currently, it will run minimum appr. every 2 weeks after each update.
When I turn it off, it will never run.
Now I set period to 3 weeks. I set run on startup to true and it runs once and completes successfully. When I restart 3 days later and again 1 day later it won’t run because 3 weeks since last successful run have not passed yet. Even on next update it will not run because 3 weeks have not passed already. Depending on the implementation it could be made that it runs after 3 weeks automatically or with the next restart after 3 weeks have passed, so it would re-run after the next update the latest.

Other example:
I turn off run on startup. But I set period to 2 weeks. Then I can restart the node as many times as I want, it would not restart the used-space filewalker. Even on update, no re-run.
But 2 weeks after the last successful finish, it would run.

That was my idea. I hope it is clearer now.

Alexey · May 10, 2024, 1:32pm

If you check your logs, you will see that it will start not immediately but with some delay.

$ cat /mnt/x/storagenode2/storagenode.log | grep "Node " | tail
2024-05-01T10:01:01Z    INFO    Node 1HhG6tE58gGMUXch83b71p63JEyewy6haL5npqEgGQKwpdEhxo started {"Process": "storagenode"}

$ cat /mnt/x/storagenode2/storagenode.log | grep "used-space" | grep "started" | tail
2024-05-01T10:01:08Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-05-01T22:00:05Z    INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-05-01T22:00:08Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode"}
2024-05-01T22:00:08Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode"}
2024-05-02T08:10:12Z    INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-05-02T08:10:13Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-05-02T08:10:13Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"}
2024-05-08T11:45:50Z    INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-05-08T11:45:50Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"}
2024-05-08T11:45:50Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"}

Alexey · May 10, 2024, 1:36pm

you changed the storage space, so it must start from scratch.

These periodic runs will be full as this one, so it will makes things worse. Ideally it’s not needed, if all operations are updates databases in time and without errors. The startup one should just fix issues if your have a third-party usage on the same disk or if there were errors and databases are not updated or if you changed the storage as in your example above.

If you noticed an issue you always can just restart the node to start the used-space scan again.

jammerdan · May 11, 2024, 3:47am

2024-05-01T10:01:01Z    INFO    Node 1HhG6tE58gGMUXch83b71p63JEyewy6haL5npqEgGQKwpdEhxo started
2024-05-01T10:01:08Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"

What I see is that the node startet at 10:01:01 and the used space filewalker for satellite 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE started 7 seconds later.

The other satellites appear to start consecutively. From my logs I have seen it that they start after each other when the previous run has either successfully completed or failed.
So no I don’t see a delay here.

When I increase the assigned space a full used-filewalker needs to run? Why?

Of course. This is the idea behind it.

I agree but it runs anyway after each update with default. So I don’t see how periodic runs on times set by the SNO could be worse than fixed automatic runs after each update.

jammerdan · May 11, 2024, 4:14am

It looks like https://review.dev.storj.io/c/storj/storj/+/12806 did not make it into the v1.104.1. That’s sad and my used-space filewalker have to remain switched off.

Alexey · May 11, 2024, 4:38am

To make sure that nothing has changed also. storagenode cannot know your intention. But you can disable this scan on start, if you know that DBs are updated.

jammerdan · May 11, 2024, 4:41am

That would be my favorite thing to do together with an automated periodic full run once every month… Either run anyway after a month has passed or run with the next startup after a month, e.g. after the update.

00riddler · May 11, 2024, 8:02am

It is the same as running the trash-cleanup-filewalker and the used-space-filewalker in parallel for the same satellite on node restart…
Who’s idea was that.
First run the deletion and the run the used space calculation.
It is as no one is thinking of any dependencies.

Alexey · May 11, 2024, 8:28am

Alexey:

$ cat /mnt/x/storagenode2/storagenode.log | grep "Node " | tail
2024-05-01T10:01:01Z    INFO    Node 1HhG6tE58gGMUXch83b71p63JEyewy6haL5npqEgGQKwpdEhxo started {"Process": "storagenode"}

$ grep "trash-cleanup-filewalker" /mnt/x/storagenode2/storagenode.log | head
2024-05-01T10:00:31Z    INFO    lazyfilewalker.trash-cleanup-filewalker subprocess exited with status   {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": -1, "error": "signal: killed"}
2024-05-01T10:01:01Z    INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess     {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-05-01T10:01:01Z    INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-05-01T10:01:01Z    INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess      trash-filewalker started        {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode", "dateBefore": "2024-04-24T10:01:01Z"}
2024-05-01T10:01:02Z    INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess      Database started        {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"}
2024-05-01T12:29:11Z    INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess      trash-filewalker completed      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode", "bytesDeleted": 28692042240, "numKeysDeleted": 103369}
2024-05-01T12:29:11Z    INFO    lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully        {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-05-01T12:30:54Z    INFO    lazyfilewalker.trash-cleanup-filewalker starting subprocess     {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-05-01T12:30:54Z    INFO    lazyfilewalker.trash-cleanup-filewalker subprocess started      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-05-01T12:30:54Z    INFO    lazyfilewalker.trash-cleanup-filewalker.subprocess      trash-filewalker started        {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "dateBefore": "2024-04-24T12:30:54Z"}

You are right.

jammerdan · May 12, 2024, 5:02am

Thank you for confirming this. I did not check in my logs yet.
So yes this is why I questioned it why it has to start immediatly at startup and suggested to make a delay that the SNO can set.
Maybe something between 0 min (start immediately) and 59 minutes or 6 hours or something?
Or completely up do him(but then it could break any dependencies that I am not aware of.)

jammerdan · May 17, 2024, 5:41am

Here is another one that appears that it has been interrupted during processing of the bloomfilter file.
The trash date folder only has a single aa subfolder:

ls /storage/trash/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/2024-05-16
aa

This suggests that there was a bloom filter starting or resuming yesterday on the 16th. But it only processed the aa folder. No other subfolders are showing.
So I assume the process got interrupted during processing and would expect the bloom filter file to be present in the retain folder.
But it is empty.

ls /config/retain

So something seems to delete the bloomfilter file even if it is not successfully completed so resuming naturally fails.

Andrew · May 17, 2024, 2:19pm

Was your node restarted while processing the bloom filter? On my nodes, I sometimes encounter the situation with the bloom filter being deleted when the nodes restart.

jammerdan · May 17, 2024, 2:54pm

I am assuming a restart has happened, yes.