Bloom filter (strange behavior)

pdeline06 · April 30, 2024, 10:18am

I noticed strange behavior with the bloom filter file ON THE WINDOWS 10!
When a node receives a bloom filter, it saves it to disk in the /retain folder as a file with the name of the satellite (ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1714067999999645000).
Next, the work of collecting garbage begins.
A folder with the name of the current date appears in /trash and garbage collection begins there.
BUT!
If during garbage collection the node is restarted for some reason, the bloom filter file is deleted from the /retain folder and after the node is restarted, garbage is no longer collected.

The question arises - why save the bloom filter file to disk if after restarting the node it is deleted and the garbage collection process stops? That is, everything is exactly the same as it was without saving the bloom filter to disk.
Why, after restarting the node, the previously received bloom filter file no longer present and garbage collection does not continue?

jammerdan · April 30, 2024, 10:35am

It does not sound like it should get deleted if not finished: GC bloom filter should be stored on disk in case of node restart · Issue #6725 · storj/storj · GitHub

bloom filter should be removed after successful GC

What version are you on?

pdeline06 · April 30, 2024, 11:07am

v1.102.4
WINDOWS NODE!

But the same thing happened on version 101.
Everyone can check this for themselves.
Look at your bloom filter file in the /retain folder.
If there is a file there, then restart the node and you will see that it is immediately deleted.

I specifically tested this on 12 nodes. The behavior is the same everywhere - the bloom filter file is deleted when the node is restarted

elek · April 30, 2024, 4:20pm

I tried to test it, but I couldn’t reproduce it. It’s not immediately deleted, but after the end of the deletion walk.

Can you please check the log, if the walker just ended shortly after the restart?

It’s also deleted in case of any errors.

pdeline06 · April 30, 2024, 4:37pm

I have these options configured:
storage2.piece-scan-on-startup: false
pieces.enable-lazy-filewalker: false

Could this be the reason?

Unfortunately, I don’t remember the nodes on which I tested this. Therefore, I can’t find information in the log now. I will definitely pay attention when working with the next bloom filter file

Please pay attention to this message:
https://forum.storj.io/t/current-situation-with-garbage-collection/25711/315?u=pdeline06
It looks like the same thing happened here that I’m writing about - when the node is restarted, the filter file is not saved

elek · April 30, 2024, 5:26pm

Shouldn’t be a problem. The first one is related the calculation of the used space.

The second one makes all the walkers running in the root process. (but actually I used the same settings, as I started storagenode with my IDE and debug it).

I am not sure about the node version of the other linked problem, but I will definitely keep my eyes on, and if there is more information, let me know… ready to debug further…

Alexey · May 2, 2024, 3:46am

I can confirm, that filter was on my node and successfully deleted after processing, before:

High CPU usage by nodes caused by receiving a new bloom filter while still processing the old one

And only one still has a bloom filter in the cache, but only for one satellite.

2024-04-27T09:45:33Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "bloomFilterSize": 4099953, "Process": "storagenode", "createdBefore": "2024-04-22T17:59:59Z"}

ls X:\storagenode2\retain\


    Directory: X:\storagenode2\retain


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         4/27/2024  12:45 PM        4099953 ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa-1714067999999645000

After:

ls X:\storagenode2\retain\
(empty)

2024-04-29T17:58:12Z    INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 1166814, "Failed to delete": 1, "Pieces failed to read": 0, "Pieces count": 12014480, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "56h12m44.0389294s", "Retain Status": "enabled"}

pdeline06 · May 4, 2024, 1:19pm

The situation repeated itself. Today I specifically checked for the presence of the bloom filter file. All nodes had a bloom filter file of 10 megabytes in size and were in the process of garbage collection. I specifically turned off all the nodes (46 nodes) correctly. After restarting the nodes, not a single node had a bloom filter file. The /ratain folder was empty!
The log file on all nodes has the same information:

INFO Got a signal from the OS: “terminated”
ERROR filewalker failed to reset progress in database {“error”: “gc_filewalker_progress_db: context canceled”, “errorVerbose”: “gc_filewalker_progress_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*gcFilewalkerProgressDB).Reset:58 \n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash.func1:171\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:244\n\tstorj.io /storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:585\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.( *Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}
ERROR retain retain pieces failed {“cachePath”: “X:\!STORJ915\WORKDIR/retain”, “error”: “retain: filewalker: context canceled”, “errorVerbose”: “retain: filewalker: context canceled\n \tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:178\n\tstorj.io/storj/storagenode /pieces.(*Store).WalkSatellitePiecesToTrash:585\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service). Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

ALL NODES ON WINDOWS 10!. Version 102.4

Help is needed! It is not normal!

nerdatwork · May 4, 2024, 2:10pm

How do you restart your node ?

Using Restart button on your cabinet/motherboard.
Using restart button under start menu
Using docker stop -t 300 storagenode

pdeline06 · May 4, 2024, 2:54pm

Using restart (shut down) button under start menu.
do not use docker

snorkel · May 5, 2024, 6:27am

This is the normal restart way for windows nodes and should work flowlessly. I’ve done it many times when I had a win node. Even on Linux/Docker the restart command and shutdown -r works and dosen’t create problems to the node or db-es. I believe the normal restart takes into account that 5 min delay, and dosen’t kill the node instantly.
There is no Restart button on a case/MB. It’s the Reset button.

Alexey · May 5, 2024, 11:50am

So, you did a restart from the Services applet, right?
And all bloom filters have been deleted from the drive?
Without a successful completion of a retain process in your logs?

pdeline06 · May 5, 2024, 3:18pm

Use common cmd commands: shutdown -s, shutdown -r. It’s normal reboot/shutdown way for windows nodes.

Yes.
In the same way, the bloom filter file is deleted if you restart the storagenode service (net stop storagenode/net start storagenode).
When the service are stops - the bloom filter file disappears/removed from the /retain folder

Exactly!

Alexey · May 5, 2024, 3:43pm

I think it’s not an expected behavior. The filter should be removed only on the successful finish of the retain process, not on restart. The whole point is to keep it during restarts, to continue the process from the stop point, not from the scratch.

pdeline06 · May 5, 2024, 4:22pm

That’s why I wrote the topic here. Maybe there is a problem on Windows nodes

Alexey · May 5, 2024, 4:57pm

Thank you, I passed this information to the team.

pdeline06 · May 5, 2024, 5:10pm

Can you re-sent bloom filter (ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa) for my 23 oldest nodes? I don’t want to miss out on the opportunity to collect garbage with a 10 megabyte bloom filter. I can write node IDs

snorkel · May 5, 2024, 11:16pm

I believe they are sent weekly and they will be 10MB from now on. If you missed one, no problem. You will get the next one that can do the same job, aproximatively.

jammerdan · May 6, 2024, 4:52am

I believe I am seeing the same but I do not have logs as the nodes have been restarted.

But what I do see is that retain folder is empty and there seems to be an interruption in the garbage collection process as there is only one single folder in trash for the US-1 satellite.: /storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-05-03/aa

I cannot verify it but it looks strange I have more folders in the other satellite trash folders but also not many. It appears that the process of collecting garbage got interrupted and did not resume due to missing bloomfilter file on disk.

Maybe it would be an idea not to delete the bloomfilter file but to move it to trash instead?

Edit: I have the same on another node:

/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-05-03/aa
/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2024-05-04/aa

Same satellite folder, different bloomfilter it seems. Both have only the aa folder in the trash. This does not look correct. The retain folder is empty.

pdeline06 · May 6, 2024, 7:13am

Yes, you are thinking correctly. As soon as the bloom filter file disappears, the garbage collection process stops. And the folder “/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/DATE/…” stops filling.
This folder contains only what the garbage collector managed to collect. Then the process is interrupted.
I noticed that if this folder contains 1024 other 2-letters folders - then garbage collection for that satellite is complete. If not - the process is interrupted