Bloom filter (strange behavior)

elek · May 7, 2024, 8:42am

I couldn’t reproduce the problem, and based on the code, it is possible only if it fails after the resume. Can you please check the logs for related errors?

pdeline06 · May 7, 2024, 9:07am

Are you reproduce on Windows 10?
So there is no problem? It seemed to us?

Unfortunately, there are no more errors in the log. Just what I already wrote

jammerdan · May 7, 2024, 9:16am

On one node I am seeing currently 2 files in the retain folder one with name ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa something and v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa something.

I don’t know what the node is currently doing but should I restart it and see if these files get deleted?

pdeline06 · May 7, 2024, 9:35am

Restart the node and see if the files remain in the /retain folder or not

jammerdan · May 7, 2024, 9:43am

Ok I have restarted it. The files are still there after restart.
Both files have date 06/05.
So I don’t know if they have been processed already, waiting to get processed or are under processing right now.

pdeline06 · May 7, 2024, 10:11am

What windows are you use?

jammerdan · May 7, 2024, 10:12am

The nodes are on Linux.

pdeline06 · May 7, 2024, 10:20am

The problem only affects WINDOWS!

pdeline06 · May 7, 2024, 10:26am

I recorded a video of how this happens!
Link to YouTube

snorkel · May 7, 2024, 10:47am

You should keep the log on info with custom log level for noise. You will see the start and finish of retain process.

pdeline06 · May 7, 2024, 11:20am

I have log.level: info

When I start a node with a bloom filter file in the /retain folder:

INFO retain Prepared to run a Retain request. {“cachePath”: “X:\!STORJ913\WORKDIR/retain”, “Created Before”: “2024-04-30T20:59:59+03:00”, “Filter Size”: 3300354, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}

The node runs for 5 minutes. Then I stop the node. (net stop storagenode):

INFO Stop/Shutdown request received.

INFO lazyfilewalker.gc-filewalker subprocess exited with status {“satelliteID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “status”: 1, “error”: “exit status 1”}

ERROR pieces lazyfilewalker failed {“error”: “lazyfilewalker: exit status 1”, “errorVerbose”: “lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:578\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

ERROR filewalker failed to get progress from database {“error”: “gc_filewalker_progress_db: context canceled”, “errorVerbose”: “gc_filewalker_progress_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*gcFilewalkerProgressDB).Get:47\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:154\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:585\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

ERROR filewalker failed to reset progress in database {“error”: “gc_filewalker_progress_db: context canceled”, “errorVerbose”: “gc_filewalker_progress_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*gcFilewalkerProgressDB).Reset:58\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash.func1:171\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:244\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:585\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

ERROR retain retain pieces failed {“cachePath”: “X:\!STORJ913\WORKDIR/retain”, “error”: “retain: filewalker: context canceled”, “errorVerbose”: “retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:178\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:585\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

The bloom filter file is no longer in the /retain folder

snorkel · May 7, 2024, 11:51am

I see now. I believe on Linux/Docker it is saved and resumes at restart. I didn’t test it, but I saw the process starting and finishing 2 times, and BF was in Retain folder, but smaller. I guess it restarted for an update or something and split the BF, taking out the part that was processed already. Maybe I’m wrong, but this is what I remember.

Alexey · May 11, 2024, 4:35am

I do not think the split is implemented, more like you received a second one for the same satellite. The suffix is a Linux epoch timestamp, so it will be different.

pdeline06 · May 14, 2024, 7:21pm

Why is there absolutely no reaction or solution to this problem? The problem has not gone away. The bloomfilter file is still deleted automatically when the node is stopped/restarted on Windows 10/11. My 46 nodes can’t be a fluke. The error appears on each of them! I provided logs and a video of the error manifestation. What else can I do? Or are the developers okay with this bug?
I have pointed out many times that the problem only appears on Windows nodes! But for some reason everyone checks the problem on Linux nodes, where there is NO problem! Is it really impossible for a developer to check the problem on a Windows node? It’s been a long time since I posted about the problem, but nothing happens.
All nodes has 104.1 for now

jammerdan · May 15, 2024, 7:13am

No, something is not right on Linux too.
I just got 3 bloomfilters deleted instead of at least 2 of them being processed:

jtolio · May 16, 2024, 1:06pm

@pdeline06 - we care, we’re just confused and so far unable to reproduce. It’s not obvious what to fix because we don’t understand what is wrong. Thank you for all of the information you’ve provided so far. We’ll keep trying to reproduce, but again, all of the debug logging and information you can provide is helpful, as is everyone else in this thread trying to understand what circumstances trigger this.

One thing that might be interesting - if you notice a bloom filter exists, can you mark it read only in your filesystem, and then see what process fails or what new logs show up when something tries to delete it?

pdeline06 · May 16, 2024, 6:33pm

Thanks for the answer.
As soon as the nodes receive the bloom filter file - I will do this

jammerdan · May 17, 2024, 5:44am

pdeline06 · May 17, 2024, 11:53am

I found a previously saved bloom filter file on one node and tried to play with it.
I copied it to the /retain folder
Next, in file security, I changed the permissions:

DENIED - ALL - DELETE
ALLOW - ALL - READ
To check, I tried to remove it myself - it didn’t work, it’s forbidden.
Next, I start service storagenode.
The node has started. A line appeared in the log:

INFO retain Prepared to run a Retain request. {“cachePath”: “X:\!STORJ913\WORKDIR/retain”, “Created Before”: “2024-04-30T20:59:59+03:00”, “Filter Size”: 3300354, "Satellite ID ": “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”}

The node began collecting garbage.
After 5 minutes I stop the service storagenode.
As soon as the service storagenode is stopped, the bloom filter file is deleted from the /retain folder.

Despite all the previously installed security attributes of the file, it was deleted! I’m shocked!

The following lines appeared in the log:

INFO Stop/Shutdown request received.

ERROR filewalker failed to reset progress in database {“error”: “gc_filewalker_progress_db: context canceled”, “errorVerbose”: “gc_filewalker_progress_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*gcFilewalkerProgressDB).Reset:58 \n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash.func1:171\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:244\n\tstorj.io /storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:565\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.( *Service).Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

ERROR retain retain pieces failed {“cachePath”: “X:\!STORJ913\WORKDIR/retain”, “error”: “retain: filewalker: context canceled”, “errorVerbose”: “retain: filewalker: context canceled\n \tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:178\n\tstorj.io/storj/storagenode /pieces.(*Store).WalkSatellitePiecesToTrash:565\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:369\n\tstorj.io/storj/storagenode/retain.(*Service). Run.func2:258\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

I do not know what else to say. I don’t see any more entries in the log.
What does a bloom filter file remove? Naturally service storagenode when stopped. Nobody cares about this file anymore. How was it deleted if the file’s security attributes prohibit it??? Perhaps the service storagenode changes attributes and deletes it? I don’t understand. There is no normal logical explanation!

Please note the last error in the log:

ERROR retain retain pieces failed {“cachePath”: “X:\!STORJ913\WORKDIR/retain”…

Could the non-standard (default) path to the /retain folder be the reason? All my paths are not the default ones. But they are registered in Config.yml

ERROR filewalker failed to reset progress in database {“error”: "gc_filewalker_progress_db…

What does filewalker have to do with it? My settings:
storage2.piece-scan-on-startup: false
pieces.enable-lazy-filewalker: false

My head is exploding because I can’t understand cause and effect!
Ready to provide remote access to the node. Maybe this will help.

jammerdan · May 17, 2024, 12:15pm

Does your node delete any file from this retain folder?
Have you tried to put like three random text files in there and start and stop and see what happens?