Badger cache filewalker test results

jammerdan · August 28, 2024, 12:12pm

It could log at startup that it is activated. I did not see anything about it.

EasyRhino · August 28, 2024, 3:48pm

When I briefly had my log level at DEBUG I indeed think I saw messages about the file stat cache. But I don’t usually run at debug.

i just look to see if the size of the badger db is growing, and can look for activity on the separate SSD I have the cache living on (mounted separate via docker)

EasyRhino · August 28, 2024, 8:16pm

So just this afternoon, I had two different nodes which were using the badger cache throw errors that prevented the whole node from working.

Not sure what was going on special, except the nodes had been doing a lot of garbage collection


2024-08-28T13:07:09-07:00       ERROR   db.filestatcache        Received err: file does not exist for table 350. Cleaning up...      {"Process": "storagenode"}
2024-08-28T13:07:09-07:00       ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: file does not exist for table 350\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Error opening database on storagenode: file does not exist for table 350\n\tstorj.io/storj/storagenode/storagenodedb.cachedBlobstore:231\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:250\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
Error: Error opening database on storagenode: file does not exist for table 350
        storj.io/storj/storagenode/storagenodedb.cachedBlobstore:231
        storj.io/storj/storagenode/storagenodedb.OpenExisting:250
        main.cmdRun:67
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:392
        storj.io/common/process.cleanup.func1:410
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:271
2024-08-28 13:07:09,793 INFO exited: storagenode (exit status 1; not expected)
2024-08-28 13:07:11,797 INFO spawned: 'storagenode' with pid 47

I was able to work around it by moving all the badger cache files to a different folder and effectively starting over from scratch.

andrew2.hart · August 29, 2024, 4:07am

Was it the update to 1114? Are your 2 nodes windows?

jammerdan · September 1, 2024, 3:56am

Another set of results. Cache on SSD. Sizes are different but improvements are so obvious that this does not matter much.

Without cache:

2024-08-28T11:55:23Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-31T06:40:12Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "Total Pieces Size": 3612477414725, "Total Pieces Content Size": 3598692620101}

Cache enabled:

2024-08-31T15:14:08Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-31T15:19:54Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "Total Pieces Size": 1038449295202, "Total Pieces Content Size": 1035869582690}

Without cache:

2024-08-31T06:40:12Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-08-31T07:39:53Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "Total Pieces Size": 414286876928, "Total Pieces Content Size": 413831412480}

Cache enabled:

2024-08-31T15:19:54Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-08-31T15:20:19Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "Total Pieces Size": 412080359680, "Total Pieces Content Size": 411629355776}

Without cache:

2024-08-31T07:39:53Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-31T09:18:26Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 190795233536, "Total Pieces Content Size": 190500340992}

Cache enabled:

2024-08-31T15:20:19Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-31T15:20:20Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 37880249344, "Total Pieces Content Size": 37864313344}

Without cache:

2024-08-31T09:18:26Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-08-31T09:37:48Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 68187273728, "Total Pieces Content Size": 68070129664}

Cache enabled:

2024-08-31T15:20:20Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-08-31T15:20:22Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 60985184256, "Total Pieces Content Size": 60885697024}

Vadim · September 1, 2024, 4:13am

Now you see why I tell all the time make cache configurable that we can put it to ssd.

Alexey · September 1, 2024, 6:26am

I think it’s not what @jammerdan have tested. Those are results without a badger cache and with the badger cache, not when it was on HDD or on SSD.
@jammerdan please correct me if I’m wrong.

jammerdan · September 1, 2024, 6:28am

Correct.
“Without cache”: Without any badger cache.
“Cache enabled”: With badger cache. But badger cache is on SSD so the results would likely be quite slower with badger cache on HDD.

Alexey · September 1, 2024, 6:38am

You may be correct, because I noticed recently that on the node with the badger cache enabled (it’s on the same HDD with data) the US1 satellite tooks a lot of time to scan (not days like before, but longer than Saltlake):

2024-08-29T18:15:15Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW"}
2024-08-29T18:15:18Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Lazy File Walker": false, "Total Pieces Size": 110854656, "Total Pieces Content Size": 110829568}
2024-08-29T18:15:18Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-29T20:03:23Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 1766839000320, "Total Pieces Content Size": 1763175659264}
2024-08-29T20:03:23Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-08-29T20:23:57Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 137937089536, "Total Pieces Content Size": 137714464256}
2024-08-29T20:23:57Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-08-30T10:50:51Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "Total Pieces Size": 2360423466410, "Total Pieces Content Size": 2354016213418}
2024-08-30T10:50:51Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-08-30T12:53:46Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "Total Pieces Size": 661815707136, "Total Pieces Content Size": 660938693632}

The node is still busy with the GC.

Andrew · September 2, 2024, 7:41pm

Did you reboot your server after each test? If not, these results might not be very meaningful because after the first run of the used-space-filewalker without Badger, the metadata would already be in RAM. This could boost the performance of subsequent runs of the used-space-filewalker even without Badger.

Alexey · September 3, 2024, 3:23am

They have a low power device with low RAM, so it’s unable to cache it fully.
But you of course correct, thus I compare results of several days difference.

jammerdan · September 3, 2024, 5:21am

That is an excellent thought.
I must admit, I did not reboot or manually cleared the cache before the filewalker with cache enabled. However the timestamps of the logs show, that there are up to almost 9 hours between the finishing without cache and the start with cache enabled. I would doubt that after that time there is still much meta data left in the RAM cache. But I don’t know for sure.

But as coincidence strikes, I have new data at least for one satellite on that node that happened to run filewalker again:

2024-09-02T21:51:56Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-09-03T04:58:05Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "Total Pieces Size": 1028755594490, "Total Pieces Content Size": 1026224779002}

Yes, it took more time but still much much faster than the initial filewalker without any caching. And the number of pieces has changed, so if it has read new data from the slow HDD into the cache it would be expected, that this would take some extra time. So I think the results are still good (on a heavily used HDD).

Edit:
Additional results are flowing in:

2024-09-03T04:58:05Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-09-03T05:24:26Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "Total Pieces Size": 425165638400, "Total Pieces Content Size": 424707436800}

2024-09-03T05:24:26Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-09-03T05:41:53Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 33298512640, "Total Pieces Content Size": 33283112704}

2024-09-03T05:41:53Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-09-03T05:42:45Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 59897545472, "Total Pieces Content Size": 59799487232}

Alexey · September 3, 2024, 5:46am

You are likely correct. If you didn’t have a special device for metadata, it is likely evicted for that time.

I agree with that.

snorkel · September 12, 2024, 9:57am

My first test results…
Synology DS216+ 8GB RAM, 2 nodes on 2 Ironwolves 8TB, ext4, noatime, no SSD, no USB drive.
Logs, databases, badger cache on storage drives.
Tested on only one drive; the other has FW disabled. Hot start, no DSM reboot.
First FW run, badger cache in the making, lazzy off, 4.4TB data - 18.5h.
Second FW run - 1h and 7min.
This is impresive. The problem is with the instability of this cache, if the UPS fails to shut down the node properly in case of an outage.
The other problem is that in the filestatcache dir, files are created each day, taking up space;
after 3 days, I have 262 files, taking 3.25GB in total. They are numbered 000xxx.sst, where xxx is a number contantly increasing. Now it’s at 806. Each file has from a few KB to a few MB, even 15MB in size. Is this normal? Are they cleaning anytime soon?

agente · September 12, 2024, 10:17am

It’s normal. About 4/5gb with 10/12tb nodes.
Can decrease if node data decrease too… of course.