It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected

Version: v1.101.3
Operating System: Ubuntu
Environment: Docker

It seems that the new feature “save-state-resume GC filewalker” isn’t functioning as expected. Both new databases (used_space_per_prefix.db and garbage_collection_filewalker_progress.db ) always remain empty (no records with satellite ID and prefix), even after the node was stopped. DBs files size stays 24576 bytes and modification time stays equal to the moment when files was originally created after updating the node. Is this the expected behavior? I thought in this version when node is being stopped filewalker should save sat id and prefix (last scanned directory) to DB.

Steps to reproduce:

  1. Execute the command: sudo docker start storagenode.
  2. Wait a few minutes/hours.
  3. Execute the command: sudo docker stop -t 300 storagenode.

The last records in the log file are as follows:

2024-04-16T17:25:29Z INFO Got a signal from the OS: “terminated” {“Process”: “storagenode”}
2024-04-16T17:25:29Z INFO lazyfilewalker.used-space-filewalker subprocess exited with status {“Process”: “storagenode”, “satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “status”: -1, “error”: “signal: killed”}
2024-04-16T17:25:29Z ERROR pieces failed to lazywalk space used by satellite {“Process”: “storagenode”, “error”: “lazyfilewalker: signal: killed”, “errorVerbose”: “lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:738\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
2024-04-16T17:25:29Z ERROR piecestore:cache error getting current used space: {“Process”: “storagenode”, “error”: “filewalker: context canceled”, “errorVerbose”: “filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:747\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}
2024-04-16T17:25:29Z DEBUG tracing collector stopped {“Process”: “storagenode”}

After the node was stopped, files used_space_per_prefix.db-wal and used_space_per_prefix.db-shm still exist in the database directory (I believe both files should be removed by the node when it shuts down).

Is this a bug, or am I misunderstanding something?

1 Like

You are seeming reporting info for the used-space filewalker, while the change is only for the garbage-collection filewalker. You will have to test again when there is an active gc-filewalker process.

The logs of “context cancelled” is from the result of you shutting down the node, causing the used-space filewalker to cancel what it’s doing, which is expected.

1 Like

I’m not sure about this. As I can see in the source code on GitHub, the developers made similar changes for both filewalkers. Would be nice to get clarification from the team.

2 Likes

@Alexey

As I have an issue with used filewalker not finishing I was wondering the same: With resume feature for the used space filewalker it may help to get this filewalker to finish.
Can you confirm the resume feature is for the used file walker as well from 1.101.3 on?

And how to determine by the logs if filewalker is in fact resuming and not restarting from the scratch?

1 Like

As I was informed, it should store a BF on the disk and if the current GC is running - it will finish its work if not interrupted, in which case it will start with a new one from scratch.
So, I do not know what to say. If it would not interrupted it should finish, otherwise - it will use a new one, if exist, otherwise it is expected to run with the saved one.

No the question was about the used space filewalker.
If it should have a working top-resume implementation already or planned or anything.

Because we see this: storagenode/pieces: save-state-resume feature for used space filewalker

1 Like

It should too, but I do not see it in the 1.101.3

Not good. That feature is really needed to help when a used space filewalker does not finish for whatever reason.

I think so, but it’s not merged, so… Please do not restart your node before it would be completed.

My problem is that the node finishes used space filewalker for AP1 and Saltlake but not for US1 and EU1 and then start from the scratch.

Is there a way to see some kind of ETA or for what version it is planned?
I think I have to turn off used filewalker completely until this feature is available.

1 Like

We don’t restart, but the darn auto-updater in Docker does it for us. I haven’t touched my nodes for 6 days since the update to 1.101.3, but the used-space-filewalker still hasn’t completed on many nodes. And now, some of them were restarted by the storagenode-updater because version 1.102.3 started to roll out.

3 Likes

Maybe @clement can tell us more when we can expect the used-spacae filewalker to be able to resume as well as he seems to be assigned to that.

I have turned off used-space filewalker now on the nodes where it does not finish. That cannot be the solution and I hope the resume feature for used-space filewalker will fix that.

Also maybe an idea to make it configurable how often the space calculation filewalker should run. Is it required on every restart? No. Maybe it is enough to run it once a week. Probably no need to run it multiple times a day.

2 Likes

I don’t know exactly how to read this, but it looks like progress is happening.

https://review.dev.storj.io/c/storj/storj/+/12806

build verify is finished successfully: storj-gerrit-verify #12921 [Jenkins]

Whatever that means.

3 Likes

Quick intro to the development process:

  • We merge all commits after +2 approval of two independent reviewers + green build
  • stroj-gerrit-verify is the first past of the build, we have an other round after the two reviews (pre-merge)
  • usually we cut release branches bi-weekly, unless there are urgent changes

Looks like the initial build is passed, so this patch seems to be ready to review. Now, we need two reviewers / all the suggested modification before a release cut. Because the preparation is the next release already started, the realistic expectations is including this one in one release later (104) – if we can get all the required changes done by that time.

But no promise. It depends on other factors as well, for example fixing / debugging other important issues / emergencies.

Also: if you are developers, I encourage you to review code, and feel free to add your comments in gerrit. Every comment can be helpful…

5 Likes

At least I thought I did:

I find the lack of “We test it this way, and then we test it that way, and then we finally test it that way” lines disturbing…

Fair point, but this is just because it was written by somebody with developer glasses (=me). Shame on me, I ignored these details:

Release cut happens usually at middle/early week. After that the release candidate is deployed to QA (you can see it if you join to the QA satellites or use the QA versioning server). Final deployment typically happens next week Monday (or later, if QA/devs are not fully confident).

BTW, you can see a huge number of integration tests in code itself (search for testplanet), which are automatically executed for every commit. They also implement full e2e test scenarios (but not on separated deployments).

4 Likes

I thought It’s resolved?

Seems there: