Can I recover from this crash

EddieA · September 7, 2025, 12:43am

Came down today to a bunch of mails saying one of my nodes was offline. In checking it appears that my TrueNAS SCALE server had cr@pped out. On restarting the server and checking Storj, it looks like it’s in a restart loop every 4 or 5 seconds. All I see in the Storj log for each attempt is:

2025-09-06T23:21:32Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2025-09-06T23:21:32Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2025-09-06T23:21:32Z    INFO    Operator email  {"Process": "storagenode", "Address": "xxx@yyy.com"}
2025-09-06T23:21:32Z    INFO    Operator wallet {"Process": "storagenode", "Address": "0xc43DBD0E344E3a75AA64D917962D06feFB5443Fc"}

Looking at the docker logs in TrueNAS shows:

2025-09-06 23:21:32.348412+00:002025-09-06 23:21:32,347 INFO spawned: 'storagenode' with pid 97
2025-09-06 23:21:32.572390+00:00unexpected fault address 0x7f1f6e7e000c
2025-09-06 23:21:32.572431+00:00fatal error: fault
2025-09-06 23:21:32.574356+00:00[signal SIGBUS: bus error code=0x2 addr=0x7f1f6e7e000c pc=0x484003]
2025-09-06 23:21:32.574389+00:002025-09-06T23:21:32.574389416Z
2025-09-06 23:21:32.574399+00:00goroutine 1 gp=0xc000002380 m=10 mp=0xc001180008 [running]:
2025-09-06 23:21:32.574405+00:00runtime.throw({0x25f2008?, 0xc000f572e8?})
2025-09-06 23:21:32.574411+00:00/usr/local/go/src/runtime/panic.go:1101 +0x48 fp=0xc000f572a0 sp=0xc000f57270 pc=0x47aaa8
2025-09-06 23:21:32.574417+00:00runtime.sigpanic()
2025-09-06 23:21:32.574422+00:00/usr/local/go/src/runtime/signal_unix.go:922 +0x10a fp=0xc000f57300 sp=0xc000f572a0 pc=0x47ca0a
2025-09-06 23:21:32.574428+00:00runtime.memmove()
2025-09-06 23:21:32.574434+00:00/usr/local/go/src/runtime/memmove_amd64.s:117 +0xc3 fp=0xc000f57308 sp=0xc000f57300 pc=0x484003
2025-09-06 23:21:32.574440+00:00github.com/dgraph-io/ristretto/v2/z.(*mmapReader).Read(0xc001060660, {0xc000e9c000?, 0x1000, 0x800010000f573a0?})
2025-09-06 23:21:32.574483+00:00/go/pkg/mod/github.com/dgraph-io/ristretto/v2@v2.0.0/z/file.go:100 +0x76 fp=0xc000f57338 sp=0xc000f57308 pc=0x1bc5ad6
2025-09-06 23:21:32.574506+00:00bufio.(*Reader).Read(0xc000900ae0, {0xc00ae99106, 0x16, 0x4783b9?})
2025-09-06 23:21:32.574510+00:00/usr/local/go/src/bufio/bufio.go:245 +0x197 fp=0xc000f57370 sp=0xc000f57338 pc=0x5aa7d7
2025-09-06 23:21:32.574513+00:00github.com/dgraph-io/badger/v4.(*hashReader).Read(0xc00aea81b0, {0xc00ae99106, 0xc000f573d8?, 0x16})
2025-09-06 23:21:32.574516+00:00/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.5.0/value.go:98 +0x2a fp=0xc000f573a0 sp=0xc000f57370 pc=0x1c2daaa
2025-09-06 23:21:32.574519+00:00io.ReadAtLeast({0x2b33b60, 0xc00aea81b0}, {0xc00ae99100, 0x1c, 0x1c}, 0x1c)
2025-09-06 23:21:32.574522+00:00/usr/local/go/src/io/io.go:335 +0x91 fp=0xc000f573e8 sp=0xc000f573a0 pc=0x4ba8b1
2025-09-06 23:21:32.574525+00:00io.ReadFull(...)
2025-09-06 23:21:32.574529+00:00/usr/local/go/src/io/io.go:354

-- Lots and lots of lines that look like backtraces

2025-09-06 23:21:32.577187+00:00runtime.goexit({})
2025-09-06 23:21:32.577193+00:00/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000088fe8 sp=0xc000088fe0 pc=0x483161
2025-09-06 23:21:32.577199+00:00created by github.com/dgraph-io/badger/v4.Open in goroutine 1
2025-09-06 23:21:32.577216+00:00/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.5.0/db.go:315 +0xc4d
2025-09-06 23:21:32.577233+00:002025-09-06T23:21:32.577233478Z
2025-09-06 23:21:32.577250+00:00goroutine 69 gp=0xc000702e00 m=nil [select]:
2025-09-06 23:21:32.577266+00:00runtime.gopark(0xc000089760?, 0x2?, 0x80?, 0x9b?, 0xc00008974c?)
2025-09-06 23:21:32.577286+00:00/usr/local/go/src/runtime/proc.go:435 +0xce fp=0xc0000895d8 sp=0xc0000895b8 pc=0x47abce
2025-09-06 23:21:32.577307+00:00runtime.selectgo(0xc000089760, 0xc000089748, 0x0?, 0x0, 0x0?, 0x1)
2025-09-06 23:21:32.577324+00:00/usr/local/go/src/runtime/select.go:351 +0x837 fp=0xc000089710 sp=0xc0000895d8 pc=0x457cb7
2025-09-06 23:21:32.577340+00:00github.com/dgraph-io/badger/v4.(*DB).updateSize(0xc000db1b08, 0xc000c62930)
2025-09-06 23:21:32.577357+00:00/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.5.0/db.go:1205 +0x13e fp=0xc0000897c0 sp=0xc000089710 pc=0x1bf4f1e
2025-09-06 23:21:32.577373+00:00github.com/dgraph-io/badger/v4.Open.gowrap2()
2025-09-06 23:21:32.577391+00:00/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.5.0/db.go:335 +0x25 fp=0xc0000897e0 sp=0xc0000897c0 pc=0x1bf0005
2025-09-06 23:21:32.577408+00:00runtime.goexit({})
2025-09-06 23:21:32.577425+00:00/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc0000897e8 sp=0xc0000897e0 pc=0x483161
2025-09-06 23:21:32.577442+00:00created by github.com/dgraph-io/badger/v4.Open in goroutine 1
2025-09-06 23:21:32.577472+00:00/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.5.0/db.go:335 +0xe56
2025-09-06 23:21:32.578617+00:002025-09-06 23:21:32,578 INFO exited: storagenode (exit status 2; not expected)

Digging deeper, each time it tries to restart, syslog throws:

Sep  6 16:21:32 truenas zed[125735]: eid=1662 class=data pool='TheBigPool' priority=0 err=52 flags=0x1008081 bookmark=27159:45869806:0:143
Sep  6 16:21:32 truenas zed[125739]: eid=1663 class=data pool='TheBigPool' priority=0 err=52 flags=0x1008081 bookmark=27159:45869806:0:143
Sep  6 16:21:32 truenas zed[125743]: eid=1664 class=data pool='TheBigPool' priority=0 err=52 flags=0x1008081 bookmark=27159:45869806:0:143

A zpool status reveals:

root@truenas[~]# zpool status -v
  pool: TheBigPool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 21.8M in 00:00:00 with 0 errors on Sat Sep  6 09:26:00 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        TheBigPool                                ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            935698bc-2eaf-4e70-b4f0-6a7ac29af87a  ONLINE       0     0 3.04K
            c6a601db-976d-4bd6-a40e-169329479a24  ONLINE       0     0 3.04K
            f52208e0-32ca-47e0-9cdf-bc94211e8323  ONLINE       0     0 3.04K
            cdd5a75a-7ac3-4c5f-a1d0-42f2b86066f4  ONLINE       0     0 3.04K
            58ead5a4-3fe8-4ae5-ba9a-9d6419cdf8c4  ONLINE       0     0 3.04K
            4577f924-97e5-42e1-a88c-884c4b690a1a  ONLINE       0     0 3.04K
        special
          mirror-3                                ONLINE       0     0     0
            21a10d62-c26f-4471-932e-946537ba2e7d  ONLINE       0     0     0
            149770a7-f6cd-434d-98e2-1e85d4f293f8  ONLINE       0     0     0
        logs
          mirror-2                                ONLINE       0     0     0
            14f4713f-7a11-4416-ae0d-e52441511975  ONLINE       0     0     0
            a141fc7d-5822-449d-bf6f-cc4f9b14814c  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/u5
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/ws
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/pf
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/in
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/7x
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/d4
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/tl

Loads more blobs directories and the occasional file within a blobs directory

        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/5c
        /mnt/TheBigPool/storej-node/data/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/tj
        TheBigPool/storagenode:<0x1>

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:31 with 0 errors on Mon Sep  1 03:45:32 2025
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme4n1p3  ONLINE       0     0     0

errors: No known data errors
root@truenas[~]#

Is there any chance I can save the node. Would deleting all the blobs directories listed in the zpool output help, which I’m assuming I’d take some hit for (hopefully recoverable).

Any other actions I can try to get more information from the node startup, which might show more information.

Cheers.

RecklessD · September 7, 2025, 1:54am

It may survive, but don’t delete blobs dir. It contains millions of files - unlikely all are damaged.

First I would be looking at why all disk’s of Raidz pool have checksum errors - all on same controller? Power cable? Interface cable? Memory?

Faulty disk’s usually show read/write errors as well.

If it’s on Storj on pool, I would run “zpool scrub TheBigPool”, followed by the brutal “zpool clear TheBigPool”, this will clear all the corrupted files. Then try restarting node.

If you are using Trunas Apps to run Storj, you may have to reinitialise the Apps section - buy removing each app, and then removing the “apps” data store, then reboot. Remember to note all settings for apps first.

EddieA · September 7, 2025, 2:22am

I was only thinking of the ones listed by the “status” command as having errors.

My guess is because the pool is spread across the 6 disks, so when a file is reported as corrupt, it counts against all the disks. Note the error count is identical across all 6:

I haven’t gone as far as removing all apps and the store (yet, as that would be a real pain to rebuild some stuff), but I did remove and re-install Storj, but no difference.

I was guessing that a “scrub” might be useful, but was going to check on the TrueNAS board first.

Cheers.

Phild13 · September 7, 2025, 2:24am

“thebigpool err=52” error indicates a problem with ZFS checksum errors or an I/O error, meaning a disk or disks in the pool is/are experiencing issues. First, power down the server, check and reseat all data cables, and then run a SMART long disk test on the drives that make up thebigpool.

I suspect one or more drives may have offline uncorrectable errors maybe due to a cable or overheating, or other hardware issue.

If the long tests pass on the drives then you may have to reinstall the StorJ app in Truenas. Just point the new install of StorJ at the old StorJ apps paths and don’t destroy or remove any old data from the old install. I think StorJ may deal with any bad files itself.

RecklessD · September 7, 2025, 2:48am

This indicates (from my experience) a problem with controller (overheating?) or dodgy cable, or insufficient power (overheating?). Not a single disk.

RecklessD · September 7, 2025, 2:55am

Cool.

As you can’t restore them from backup, you will need to use “zpool clear” to remove them. Just deleting the file will still leave it in the “zpool status” list just without a name.

The “zpool clear” doesn’t remove the damaged file, merely the damaged status.

Alexey · September 7, 2025, 3:18am

It will disqualify your node for losing customers data. So, no, it’s not an option.

EddieA:

2025-09-06 23:21:32.572390+00:00unexpected fault address 0x7f1f6e7e000c
2025-09-06 23:21:32.572431+00:00fatal error: fault
2025-09-06 23:21:32.574356+00:00[signal SIGBUS: bus error code=0x2 addr=0x7f1f6e7e000c pc=0x484003]
2025-09-06 23:21:32.574389+00:002025-09-06T23:21:32.574389416Z

This is usually happen if there is a hardware issue.
But you may try to stop the application, delete the bin subfolder from the storage location and start the application back.

EddieA · September 7, 2025, 5:20am

Wouldn’t deleting the application in TrueNAS do that. Or would I have to resort to:

Cheers.

EasyRhino · September 7, 2025, 5:36am

I also use zfs, but my storj disks have no redundancy. since your array does, I would also go for trying a scrub.

how many files list errors? If it’s just a few your node may survive with acceptable missing files.

but if they audit percentage goes under 96% then the node is disqualified and it’s game over.

Alexey · September 7, 2025, 6:47am

No. It shouldn’t delete the user’s data on the app deletion. You need to do it yourself.

arrogantrabbit · September 7, 2025, 6:58am

What controller are you using? What backplane? What cables? Maybe they wiggled out of the connectors. What type of memory? What MLB? Run memtest for a few hours. Did you update anything recently? Trunas? Firmware?

What type of controller related mesages are in /var/log/messages? What is in dmesg?

You have raidz2, you have plenty of redundancy. Fix the hardware issues, then run a scrub. As others pointed out, it’s unlikley all four disks crapped out, so check eveything, including power supply.

It’s very possible your EIO (52) error is somewhere between a HBA and a backplane. If it was freebsd I coudl give you specific search terms in messages and dmesg, but with linux you’ll have to look through logs.

Once you fix the issues, scrub shall recover all data, i don’t expect anything to b e lost, except some unfinishd transactions taht happened sicne the issue manifested

Momi_V · September 7, 2025, 2:47pm

ZFS ALWAYS reports checksum errors per disk. So if half a dozen otherwise healthy disks suddenly have the same number of errors something is EXTREMELY wrong with the way data gets from the disks to the CPU.

Check the Cables aren’t loose, then run a memtest, then scrub again. If it still happens switch the drive controller/HBA, and replace the cables, then scrub again. If the drives are running through a backplane connect them directly to the mainboard/HBA.

Even if files are spread across disks, if a read error occurs one drive A, or a request from drive B returns bad data it’s ALWAYS reported on just the drive that misbehaved and is directly responsible for the error.

It’s incredibly unlikely 6 disks went bad at once, so the error must be somewhere else in the signal path.

EddieA · September 7, 2025, 6:54pm

Not sure I’m following. There’s nothing named bin in the data or identity directories.

Or are you talking about the bin directories under the docker/overlay2 hierarchy, in which case how do I identify the correct one(s) out of 67. None contain any files with the substring “stor” in the name.

Cheers.

EddieA · September 7, 2025, 7:06pm

It’s the Pro version of this: ZimaCube Pro Personal Cloud – Zima Store Online

The backplane and connectivity from the backplane to the motherboard are unique to these NAS’s.

The system had been running for 127 days with no changes, other than the docker apps updating as needed.

The only messages I saw in the logs prior to the issue were a variation on the ones I’m seeing now:

Sep  6 01:17:24 truenas kernel: zio pool=TheBigPool vdev=/dev/disk/by-partuuid/14f4713f-7a11-4416-ae0d-e52441511975 error=5 type=2 offset=17205694464 size=131072 flags=3145856
Sep  6 01:17:24 truenas kernel: zio pool=TheBigPool vdev=/dev/disk/by-partuuid/14f4713f-7a11-4416-ae0d-e52441511975 error=5 type=2 offset=17205825536 size=131072 flags=3145856
Sep  6 01:17:24 truenas kernel: zio pool=TheBigPool vdev=/dev/disk/by-partuuid/14f4713f-7a11-4416-ae0d-e52441511975 error=5 type=2 offset=17205956608 size=131072 flags=3145856
Sep  6 01:17:24 truenas kernel: zio pool=TheBigPool vdev=/dev/disk/by-partuuid/14f4713f-7a11-4416-ae0d-e52441511975 error=5 type=2 offset=17206087680 size=131072 flags=3145856

Cheers.

EddieA · September 7, 2025, 7:11pm

zpool status reports 141 data errors, all appearing to be from the blobs hierarchy and mostly directories. Counting the number of directories under blobs is 4101, so I make that a lotter bigger % than 4.

Cheers.

RecklessD · September 7, 2025, 11:28pm

It’s only when a peice is accessed that you get penalised for it being corrupt/missing. If your lucky, peices get deleted and never accessed - no harm / no foul.

Twelve months after a hardware emergency, I still get 3 - 6 audit failures a month on a node - at this rate they mean nothing.

Alexey · September 8, 2025, 4:37am

It should be in the data location folder, which you mounted to the container as config. For example, if you have a pool named data and there is a dataset storagenode, in which you have two other datasets - config and identity, then the bin folder will be in the config dataset, i.e. /mnt/data/storagenode/config/bin.
Please note, the app should be stopped when you will remove that folder.

EddieA · September 8, 2025, 5:22am

Here’s my data/config directory:

root@truenas[/mnt/TheBigPool/storej-node/data]# ls -l
total 603
-rwxrwx--- 1 apps root  14046 Sep  6 09:44 config.yaml
-rwxrwx--- 1 apps root      0 Sep  7 00:00 node.log
-rwxrwx--- 1 apps root 102819 Jun  8 00:00 node.log-20250608.gz
-rwxrwx--- 1 apps root  17733 Jun 15 00:00 node.log-20250615.gz
-rwxrwx--- 1 apps root  16153 Jun 22 00:00 node.log-20250622.gz
-rwxrwx--- 1 apps root  17978 Jun 29 00:00 node.log-20250629.gz
-rwxrwx--- 1 apps root  16296 Jul  6 00:00 node.log-20250706.gz
-rwxrwx--- 1 apps root  19087 Jul 13 00:00 node.log-20250713.gz
-rwxrwx--- 1 apps root  16676 Jul 20 00:00 node.log-20250720.gz
-rwxrwx--- 1 apps root  21415 Jul 27 00:00 node.log-20250727.gz
-rwxrwx--- 1 apps root  18297 Aug  3 00:00 node.log-20250803.gz
-rwxrwx--- 1 apps root  16465 Aug 10 00:00 node.log-20250810.gz
-rwxrwx--- 1 apps root  18021 Aug 17 00:00 node.log-20250817.gz
-rwxrwx--- 1 apps root  21011 Aug 24 00:00 node.log-20250824.gz
-rwxrwx--- 1 apps root  20576 Aug 31 00:00 node.log-20250831.gz
-rwxrwx--- 1 apps root 712494 Sep  7 00:00 node.log-20250907
drwxrwx--- 4 apps root      4 Jan  4  2025 orders
drwxrwx--- 2 apps root      2 Sep  4 21:01 retain
-rwxrwx--- 1 apps root  32768 Sep  2 15:55 revocations.db
-rwxrwx--- 1 apps root      0 Sep  6 16:20 setup.done
drwxrwx--- 8 apps root     57 Sep  6 01:15 storage
-rwxrwx--- 1 apps root    936 Sep  5 22:34 trust-cache.json
root@truenas[/mnt/TheBigPool/storej-node/data]#

Cheers.

Alexey · September 8, 2025, 5:41am

Interesting. Does it downloads a binary every time when you start the app?

EddieA · September 8, 2025, 7:08am

I really don’t know how the docker apps work under the cover in TrueNAS SCALE.

Cheers