Node keeps restarting with 100% disk utilization

Hello,

I have 1 node running on a Synology NAS (RS422+ / DSM7). It has its own dedicated volume (with 1 disk) with nothing else on it. The DB is currently on the same volume as the storage itself.

It has been running fine for a couple of weeks with heavy disk util and occasional 100% period of utilisation, but rare shutdowns (1 every couple of days). Since June 18th (last update), it is constantly pushing the disk to 100% with high reads (I assume these are filewalkers). And the node restarts every couple of hours (10 times today so far) and 9 times yesterday.

htop confirms that storj is the only process significantly consuming I/O resources.

Here is the history of the volumes IOPS. The storage is on volume 1.

and Util:


Even in the period 13th to 16th, the util was high but not quite 100% and the system was responsive.

I have greped the last 5k lines of log on “ERROR”, here is an excerpt (Full file)

2025-06-19T08:47:40Z    ERROR   db      failed to stat blob in trash    {"Process": "storagenode", "namespace": "ootPBOELroXWf0xsuCv41MDw9HqOpyYnUk3rbsAAAAA=", "key": "MEMBkZFQMFYjDH91cdfUaPUJdqoLbs7miqz8NAVHxUo=", "error": "lstat config/storage/trash/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/2025-06-11/gb/bqdemrkayfmiymp52xdv6und2qs5vkbnxm5zukvt6dibkhyvfa.sj1: no such file or directory"}
2025-06-19T09:11:36Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "forgetsatellite:chore", "error": "database is locked"}
2025-06-19T09:11:36Z    ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:302\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:104\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T09:11:36Z    ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: lazyfilewalker: signal: killed", "errorVerbose": "pieces error: lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:196\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:486\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:86\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2025-06-19T09:11:39Z    ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
2025-06-19T09:11:44Z    ERROR   failure during run      {"Process": "storagenode", "error": "Error opening database on storagenode: database: pricing opening file \"config/storage/pricing.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:402\n\tstorj.io/common/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:116\n\tmain.main:34\n\truntime.main:283", "errorVerbose": "Error opening database on storagenode: database: pricing opening file \"config/storage/pricing.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabaseWithStat:390\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:367\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:362\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:337\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:302\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:402\n\tstorj.io/common/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:116\n\tmain.main:34\n\truntime.main:283\n\tmain.cmdRun:69\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:402\n\tstorj.io/common/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:985\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1117\n\tgithub.com/spf13/cobra.(*Command).Execute:1041\n\tstorj.io/common/process.ExecWithCustomOptions:116\n\tmain.main:34\n\truntime.main:283"}
2025-06-19T11:57:09Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "forgetsatellite:chore", "error": "database is locked"}
2025-06-19T11:57:09Z    ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: database is locked", "errorVerbose": "satellitesdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T11:57:09Z    ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: lazyfilewalker: signal: killed", "errorVerbose": "pieces error: lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:196\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:486\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:86\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2025-06-19T11:57:09Z    ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:302\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:104\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T11:57:13Z    ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
2025-06-19T12:17:49Z    ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:302\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:104\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T12:17:49Z    ERROR   pieces:trash    emptying trash failed   {"Process": "storagenode", "error": "pieces error: lazyfilewalker: signal: killed", "errorVerbose": "pieces error: lazyfilewalker: signal: killed\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkCleanupTrash:196\n\tstorj.io/storj/storagenode/pieces.(*Store).EmptyTrash:486\n\tstorj.io/storj/storagenode/pieces.(*TrashChore).Run.func1.1:86\n\tstorj.io/common/sync2.(*Workplace).Start.func1:89"}
2025-06-19T13:21:41Z    ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: database is locked", "errorVerbose": "satellitesdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:198\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T13:21:41Z    ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "forgetsatellite:chore", "error": "database is locked"}
2025-06-19T13:21:41Z    ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:302\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:104\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T13:21:43Z    ERROR   piecestore:cache        error persisting cache totals to the database:  {"Process": "storagenode", "error": "piece space used: context canceled", "errorVerbose": "piece space used: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).UpdatePieceTotalsForSatellite:188\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).UpdatePieceTotalsForAllSatellites:159\n\tstorj.io/storj/storagenode/pieces.(*CacheService).PersistCacheTotals:144\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func2.1:127\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func2:122\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Volume 2 is high when it is running Hyperbackup and the Plex schedules tasks on a daily basis. But this had not caused any issue so far.

Appreciate any help !
Cheers.
Any clue what could be happenin, or how I can diagnose this further?

The core issue seems to be the “database is locked” problem. So before filewalker completes the node restarts… which launches filewalker again: so you’re always at 100% io.

But I don’t know what causes locked-database issues? Need someone smarter to read this…

Thanks. In reading other threads, I was more thinking that the DB message is a consequence of the disk not being available due to high I/O.

Hello @the_french_SNO ,
Welcome to the forum!

You are likely correct, while disk under the load the database can be locked to add data, this usually suggest a slow disk.

However, your node is managed to lost some data accordingly this error:

So, I would suggest to stop the node and check and fix errors on the disk first. After that if the filewalker would still fail, then you may consider to enable the badger cache or migrate to hashstore.

Hello @Alexey
Thanks for your answer. I stopped the node yesterday via docker and restarted it. And since then the issue has disappeared… Pretty weird. I have not done any operation on the disk…
I will look into the 2 posts you referenced for information.

Thanks.

Perhaps you have a scrub job to do a housekeeping for your storage and it has fixed the issue?

I am on a Synology on DSM7, and disks are on BTFRS I have not scheduled anything myself. I checked some parameters, but could not see anything related to this… Automatic data cleanup is turned off. All SMART tests are green. Not sure…
Anyway, unless someone has a brilliant idea to forensically diagnose this further, I think we can close this as not happening anymore.

As part of the stop and restart, did you rm the container?
That could explain it?

Nope. The container was created via the command line in SSH. But I stopped and restarted it using the DSM container manager GUI.

docker run -d --restart unless-stopped --stop-timeout 300 \
    -p 28967:28967/tcp \
    -p 28967:28967/udp \
    -p 14002:14002 \
    -e WALLET="0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" \
    -e EMAIL="xxxxx@xxxxxx.com" \
    -e ADDRESS="XX.XX.XX.XX:28967" \
    -e STORAGE="7TB" \
    --user 1037:65541 \
    --mount type=bind,source="/volume1/storj/Identity",destination=/app/identity \
    --mount type=bind,source="/volume1/storj/storage",destination=/app/config \
    --log-opt max-size=20m \
    --log-opt max-file=5 \
    --name storjnode storjlabs/storagenode:latest \
    --log.level=info \
    --log.custom-level=piecestore=FATAL,collector=WARN \
    --operator.wallet-features=zksync-era

Synologys implementation of docker is not always perfect, and won’t remove a container on stop. This is needed, for some of the configuration changes to go through.

I’d highly suggest moving your docker run command into a docker compose script, and start/stop it with docker compose up/down via ssh to the box directly - it solved all my issues on Synology

2 Likes

Thanks, I know I need to do docker rm to change config. Moving to compose is on the todo :wink:
So, just after restarting the container (no rm) the problem seemed to have disappeared. Though this may just be coincidence…

Yes, it’s a coincidence. The root of the problem is that your node generates absolutely unnecessary IO that his hitting your disk subsystem which cannot handle it the present configuration.

Synology supports metadata caching. I strongly suggest utilizing it, as well as optimizing the subvolumes where storagnode data lives for access and not consistency: turn off sync, disable btrfs redundancy, turn off atime updates, move databases entirely on the SSD if possible, but at minimum – disable sync on the subvolume with the databases, or even move them to a Ramdisk. Databases are 100% useless and disposable.

Ideally the bulk of IO shall hit accelerator device, not HDD. You shall see no more than 20-30 IOPs from the node.

This:

is a non-starter.

Thanks for looking into this.

For context, I use the NAS for other things for which I need the redundancy, etc. So I cannot tune the system-wide settings just for StorJ. It’s a choice I made and I’d rather stop being an SNO if it becomes unmanageable.

From your suggestions, I am not a noob on DSM, but not an expert either. So here are a few questions:

  • turn off sync → What is “sync” for you, and how? I did not see any options in DSM. Is accessible via SSH?
  • “disable btrfs redundancy” → I am not sure what you mean. Are you talking about RAID or similar? My StorJ volume is on a SHR volume which has 1 disk. My other volume is also in SHR with 3 disks. Both are Btrfs
  • What is “atime updates”?

Unfortunately the NAS I have does not have an SSD port, and I have 4 disks without extension capability. It also only has 2GB which is not extendable, so not sure a Ramdisk would be possible unfortunately.

  • “disable sync on the subvolume with the databases” → That would require to set-up a third volume that would take at least 1 disk, right? If so, I am not sure I am willing to do this. I need the 3 disks for size and redundancy on the other volume. I guess once I know what you mean by sync, I can dig into this further.

Here are the last few days in IOPS:


and Util:

Thanks a lot anyways.

I can elaborate more on other questions, including btfs pools vs subvolumes, mount parameters and CoW, but this makes the point moot:

That appliance is not suitable for storj, period.

Storagenode will keep dying unless metadata access is accelerated. You want into fit in ram, or the next best thing — on an SSD.

Furthermore, performance of your other pool will plummet because filesystem caches will be getting evicted to fit partially that of a storagenode.

Don’t ruin nodes on that appliance.

OK, well, it runs OK for now… If I change hardware, I’ll make sure I put more RAM and an SSD somewhere.
Thanks.

Topic title: “ Node keeps restarting with 100% disk utilization”

:person_shrugging:

That his was not a fix, but kicking the can down the road. Fundamental issue is still there. Once traffic increases, caches fill, and usage picks up — node will die again. Then you will restart it again? What’s the point of running a node at 10% of its capabilities all while decimating performance of your NAS?

Hello @arrogantrabbit

As you can check in my second post on the day after I created the topic, I stated:

And further along, I also stated:

So I don’t really get what’s wrong?

Then, while you seem quite confident and decisive in your conclusions (and you might be right), I still think there might be something else, as your pattern does not seem to verify.

It was several weeks before, and now 1 week after and while the disk is busy for sure, and probably for all the reasons you mentioned, it is still running fine and filling up with no “dying”. And the performance I am getting from the NAS on the rest of my applications is absolutely fine as well.
So 1) as stated previously, I don’t have an issue right now. But I will not be surprised if I get more later.
2) I am keen to learn, so if you want to share with me (and the community) some more knowledge on “btfs pools vs subvolumes, mount parameters and CoW”, that would be great.
If not, that’s fine as well, and thanks for your help.

Best regards.

I believe @arrogantrabbit, with their inescapable charm, just tries to say that the setup is not really sustainable, even if it might briefly work, and that indeed it might be a good idea to kill the node if migration to better hardware is out of question.

I’ll add though that there are periodic high-I/O chores the node needs to perform, without which you will be hitting other problems. For example, trash collection is I/O intensive, and constant restarting will prevent it from working, leaving trash on your disk. I suspect the graphs you have shared with us may actually be showing these chores, as opposed to just higher than usual traffic. While there are workarounds that reduce the impact of restarts, in the end, your node will just not bring enough revenue to justify its operation.

In the end, decision is yours. If I may suggest something though, you may also decide to limit the size of the node storage to stop further uploads. That’s an easy way to limit I/O necessary for the node to function while waiting for an opportunity to upgrade hardware in future.

1 Like

Thanks. Indeed, I had matched the high IOPS with the filewalkers logs and they matched.
I understand you both feel like the node is unsustainable and that it “barely” works. What I am trying to say is that, it actually works well, or well enough : The system is reachable without delay, apps work (surveillance, FTP, photos, Plex), and from a StorJ perspective, it keeps growing, stays online and do not crash.
Yes, it is pretty busy and my StoreJ volume is often 99% util, but that’s OK and does not impact the rest. I just had one “glitch” in about 4 weeks where the system was completely unusable for no reason I could understand. And it is this “glitch” that I wanted to get a root cause on.

By the way, I am getting 50GB of ingress per day. With a total traffic at 80GB / day. Is this considered “a lot” by any scale you have?

I think it could be a one of the chores or both - they could be run in parallel, like deleting expired pieces with a TTL and deleting pieces from the trash. They are very high IOPS intensive especially for BTRFS without any acceleration either with additional RAM or SSD.