Release preparation v1.108

snorkel · July 11, 2024, 9:15pm

If you disabled lazy mode, the GC-filewalker still dosen’t show log entries; only the startup piecescan.

Ambifacient · July 12, 2024, 2:35am

Small update: A node that was running with badger cache processed two SLC GC rounds, while the used-space filewalker was running on US1 (had to restart once, the used-space filewalker never started on any other satellite but US1).

Round 1: Trashed 8M pieces in 29 hours. Walked 30M pieces.

2024-07-09T12:18:13-07:00	INFO	retain	Prepared to run a Retain request.	{"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-30T14:28:54Z", "Filter Size": 15467040, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-10T18:00:30-07:00	INFO	retain	Moved pieces to trash during retain	{"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 7987696, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 30533629, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "29h42m16.736705678s", "Retain Status": "enabled"}

Round 2: Trashed 800k pieces in 7 hours. Walked 22M pieces.

2024-07-11T12:03:27-07:00	INFO	retain	Prepared to run a Retain request.	{"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-30T14:28:54Z", "Filter Size": 15457309, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-11T19:20:41-07:00	INFO	retain	Moved pieces to trash during retain	{"Process": "storagenode", "cachePath": "config/retain", "Deleted pieces": 815139, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 22364200, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "7h17m13.536466067s", "Retain Status": "enabled"}

As the number of trashed pieces is different, I’m not sure if these figures are comparable.

This node is about 12TB big on XFS, and so far the filestatcache directory is 1.2GB, still needing to complete the used-space filewalker.

Will update again once it’s done.

Side concern: The filestatcache is generating a lot of .sst files, is this expected?

Alexey · July 12, 2024, 7:20am

It’s not a 1.107.x problem unfortunately, it’s like this for a long time. But the reason is rewritten the TTL data, so it’s collected by a GC before the TTL. And now the TTL collector is complaining.
Perhaps we need to update a TTL database if the pieces with TTL are sent to the trash.

Alexey · July 12, 2024, 7:23am

Even if I agree with you, this is mean to implement a workaround in the code specifically for SLC. It sounds as bad as it could be.

Alexey · July 12, 2024, 7:27am

= the disk is slow.
The current workaround is to move the databases to a SSD.
Unlikely it is related to the (any) version, more like now your HDDs are busy enough to do not have a time for the databases updates.
Also likely depends on when the TTL data should be removed. Or do you mean, that you downgraded (not recommended, since in 1.107.x and later there are database migrations, unlikely you can to downgrade at all) the same storagenode and do not see the same error?

Alexey · July 12, 2024, 7:34am

It’s fixed not a filewalker, but the databases updates, however, not for all filewalkers (the TTL collector didn’t update databases on versions before 1.108.x, the trash filewalker didn’t update databases on versions before 1.107.x)…

Alexey · July 12, 2024, 7:37am

It could be a reason for slow filewalker process, XFS somehow is slow on that, at least from reports of other SNOs: Topics tagged xfs.

perhaps yes?

Vadim · July 12, 2024, 9:39am

So today we have possibilities not to send to trash at all, just need to make it configurable with options

littleskunk · July 12, 2024, 10:27am

This is another feature that can destroy the entire network. We had to call restore from trash more than once in the history. So the trash folder is important for the resilience of the network. We can’t just give up on that. Not even on SLC. There are way too many code changes in that area that need the protection of restore from trash. And we also want to collect data about the amount of repair. So no we can’t take a shortcut on SLC.

flo82 · July 12, 2024, 10:35am

The databases are already on nvme. So they should not be slow.

I don’t downgrade versions. First time i saw this error was with V1.108.

So this error should not lead to disqualification?

pangolin · July 12, 2024, 11:07am

"error": "pieceexpirationdb: context canceled"

Same here, all databases on ssd.

Vadim · July 12, 2024, 2:39pm

I agree with you arguments, but why this feature exist?
pieces.delete-to-trash: false

littleskunk · July 12, 2024, 4:01pm

Try it out. It makes no difference anymore.

Vadim · July 12, 2024, 4:34pm

I think it deprecated intentionally, as it is dangerous feature.

littleskunk · July 12, 2024, 4:38pm

Yea kind of. It is deprecated for other reasons. Full story is here: Design draft: fixing deletes and server side copy

Lets say this flag only works if the satellite sends you delete messages. On the satellite side that is just a feature flag and can be turned on and off. We would have to disable / enable server side copy accordingly. They don’t work in combination with delete messages. Which basically means on the current satellites there is no way to go back in time. But can still be useful for storj private cloud setups or any other brand new satellite that never had server side copy enabled.

agente · July 12, 2024, 5:46pm

Badger cache can save and resume progress? I mean in the first cold used scan if I restart node I will lose progress?

Ambifacient · July 12, 2024, 6:23pm

Caching is separate from saving filewalker progress, although if you restart the filewalker and lose progress the caching should speed it up. used-space progress is not implemented yet.

Mark · July 12, 2024, 9:30pm

While trying to setup a new node on the QA satellite (test network), I accidentally upgraded a regular node to v1.108.3, so I figured I’d try out this new badger cache on it. I added the following line to config.yaml and restarted it:
pieces.file-stat-cache: badger

I don’t know if that’s the correct procedure to use this new feature, but It seemed to be working ok at first. After it finished the first file walker I restarted the node. Before long it went into a looping cycle of errors and restarts. The error in the log file is:

2024-07-12T21:00:43Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
2024-07-12T21:00:43Z    FATAL   Unrecoverable error     {"Process": "storagenode", "error": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error opening database on storagenode: while opening memtables error: while opening fid: 2 error: while updating skiplist error: mremap size mismatch: requested: 20 got: 134217728\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}

I commented out the badger line in my config file and restarted the node and it’s working normally again. What caused this? user error? software bug? hardware problems? I don’t know, but I’m going to leave this feature turned off until I learn more about it. Just sharing my experience with the community. Thanks.

jammerdan · July 13, 2024, 2:37am

It seems it is not a stable feature (yet):

Alexey · July 13, 2024, 7:05am

No, but the pieces may be deleted by a garbage collector instead of the TTL collector.
Is it possible, that you have another error somewhere earlier like a FATAL error? Maybe you have Unrecoverable errors in your logs?