If you disabled lazy mode, the GC-filewalker still dosen’t show log entries; only the startup piecescan.
Small update: A node that was running with badger cache processed two SLC GC rounds, while the used-space filewalker was running on US1 (had to restart once, the used-space filewalker never started on any other satellite but US1).
Round 1: Trashed 8M pieces in 29 hours. Walked 30M pieces.
2024-07-09T12:18:13-07:00 INFO retain Prepared to run a Retain request. {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-30T14:28:54Z", "Filter Size": 15467040, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-10T18:00:30-07:00 INFO retain Moved pieces to trash during retain {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 7987696, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 30533629, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "29h42m16.736705678s", "Retain Status": "enabled"}
Round 2: Trashed 800k pieces in 7 hours. Walked 22M pieces.
2024-07-11T12:03:27-07:00 INFO retain Prepared to run a Retain request. {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-30T14:28:54Z", "Filter Size": 15457309, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-11T19:20:41-07:00 INFO retain Moved pieces to trash during retain {"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 815139, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 22364200, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "7h17m13.536466067s", "Retain Status": "enabled"}
As the number of trashed pieces is different, I’m not sure if these figures are comparable.
This node is about 12TB big on XFS, and so far the filestatcache directory is 1.2GB, still needing to complete the used-space filewalker.
Will update again once it’s done.
Side concern: The filestatcache is generating a lot of .sst files, is this expected?
It’s not a 1.107.x problem unfortunately, it’s like this for a long time. But the reason is rewritten the TTL data, so it’s collected by a GC before the TTL. And now the TTL collector is complaining.
Perhaps we need to update a TTL database if the pieces with TTL are sent to the trash.
Even if I agree with you, this is mean to implement a workaround in the code specifically for SLC. It sounds as bad as it could be.
= the disk is slow.
The current workaround is to move the databases to a SSD.
Unlikely it is related to the (any) version, more like now your HDDs are busy enough to do not have a time for the databases updates.
Also likely depends on when the TTL data should be removed. Or do you mean, that you downgraded (not recommended, since in 1.107.x and later there are database migrations, unlikely you can to downgrade at all) the same storagenode and do not see the same error?
It’s fixed not a filewalker, but the databases updates, however, not for all filewalkers (the TTL collector didn’t update databases on versions before 1.108.x, the trash filewalker didn’t update databases on versions before 1.107.x)…
It could be a reason for slow filewalker process, XFS somehow is slow on that, at least from reports of other SNOs: Topics tagged xfs.
perhaps yes?
So today we have possibilities not to send to trash at all, just need to make it configurable with options
This is another feature that can destroy the entire network. We had to call restore from trash more than once in the history. So the trash folder is important for the resilience of the network. We can’t just give up on that. Not even on SLC. There are way too many code changes in that area that need the protection of restore from trash. And we also want to collect data about the amount of repair. So no we can’t take a shortcut on SLC.
The databases are already on nvme. So they should not be slow.
I don’t downgrade versions. First time i saw this error was with V1.108.
So this error should not lead to disqualification?
"error": "pieceexpirationdb: context canceled"
Same here, all databases on ssd.
I agree with you arguments, but why this feature exist?
pieces.delete-to-trash: false
Try it out. It makes no difference anymore.
I think it deprecated intentionally, as it is dangerous feature.
Yea kind of. It is deprecated for other reasons. Full story is here: Design draft: fixing deletes and server side copy
Lets say this flag only works if the satellite sends you delete messages. On the satellite side that is just a feature flag and can be turned on and off. We would have to disable / enable server side copy accordingly. They don’t work in combination with delete messages. Which basically means on the current satellites there is no way to go back in time. But can still be useful for storj private cloud setups or any other brand new satellite that never had server side copy enabled.
Badger cache can save and resume progress? I mean in the first cold used scan if I restart node I will lose progress?
Caching is separate from saving filewalker progress, although if you restart the filewalker and lose progress the caching should speed it up. used-space progress is not implemented yet.
While trying to setup a new node on the QA satellite (test network), I accidentally upgraded a regular node to v1.108.3, so I figured I’d try out this new badger cache on it. I added the following line to config.yaml and restarted it:
pieces.file-stat-cache: badger
I don’t know if that’s the correct procedure to use this new feature, but It seemed to be working ok at first. After it finished the first file walker I restarted the node. Before long it went into a looping cycle of errors and restarts. The error in the log file is:
2024-07-12T21:00:43Z ERROR failure during run {"Process": "storagenode", "error": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
2024-07-12T21:00:43Z FATAL Unrecoverable error {"Process": "storagenode", "error": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
I commented out the badger line in my config file and restarted the node and it’s working normally again. What caused this? user error? software bug? hardware problems? I don’t know, but I’m going to leave this feature turned off until I learn more about it. Just sharing my experience with the community. Thanks.
It seems it is not a stable feature (yet):
No, but the pieces may be deleted by a garbage collector instead of the TTL collector.
Is it possible, that you have another error somewhere earlier like a FATAL error? Maybe you have Unrecoverable errors in your logs?