ZFS pool didn’t re-mount automatically

OK, new problem. I tried restarting the VM. Once it finished the ZFS pool didn’t re-mount automatically as it usually does. Manually remounting and checking the storage node container its stuck in a restarting loop.

STATUS
Restarting (1) 19 seconds ago

Any ideas? Looks like a restart took my bad situation and made it worse.

Check if the paths to identity and storage are correct.

Would you happen to know what those directories are for Ubuntu Server? I haven’t needed to pay attention to those since I set the node up over 4 months ago.

Please look at your logs. It should state why it is restarting.

I see these errors looping in the log:

Error: piecestore monitor: disk space requirement not met

    2020-05-08T13:51:11.528Z    WARN	piecestore:monitor	Disk space is less than requested. Allocating space	{"bytes": 22814736384}
    2020-05-08T13:51:11.528Z	ERROR	piecestore:monitor	Total disk space less than required minimum	{"bytes": 500000000000}
    2020-05-08T13:51:11.528Z	ERROR	piecestore:cache	error persisting cache totals to the database: 	{"error": "piece space used error: context canceled", "errorVerbose": "piece space used error: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceSpaceUsedDB).UpdatePieceTotals:174\n\tstorj.io/storj/storagenode/pieces.(*CacheService).PersistCacheTotals:100\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:80\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:56\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
    2020-05-08T13:51:11.529Z	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb error: context canceled", "errorVerbose": "satellitesdb error: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:100\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:54\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:56\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
    2020-05-08T13:51:11.529Z	ERROR	nodestats:cache	Get pricing-model/join date failed	{"error": "context canceled"}
    2020-05-08T13:51:11.529Z	ERROR	collector	error during collecting pieces: 	{"error": "piece expiration error: context canceled", "errorVerbose": "piece expiration error: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:39\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:483\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:88\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:55\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/storj/storagenode/collector.(*Service).Run:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:56\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

I’m not sure how to troubleshoot this. I remounted the pool. However it is looking like the pool no longer has a mount directory. Will troubleshoot this later. In the meantime your input is appreciated.

Is the array mounted to the OS (not at the docker stage yet) and are the files there? If not then that will need to be sorted out first. Run lsblk and make sure the disks are all there and if they are then see if the zpool is available.

Your node is seeing only about 22GB of free space according to the logs. Most likely the storage isn’t mounted and that results in it trying to write to a small system disk.

You should be glad the node isn’t starting, because if it would have started without the data disk/array mounted it would be disqualified in no time.

Doublecheck where the array is mounted, make sure it’s a static mount. And check the data location in the run command. If everything is correct, restart the node.

3 Likes

It should be written in the docs as recommendation not to use the root of a mounted disk as storage and rather create a subdirectory. Maybe that’s what the OP did and therefore avoided getting disqualified.

1 Like

For some reason ZFS isn’t mounting the pool at start-up. Though I do believe the data still exists on the array yes. lsblk does display all the drives needed for the pool and their correct capacities. If I go to manually import the pool I get the error:

no pools found: some devices require root privileges
cannot import ‘storj’: a pool with that name already exists
use the form ‘zpool import <pool | id> ’ to give it a new name

But then when I re-run zpool status suddenly the pool is there when it didn’t appear before.

Going backwards now on system startup it displayed:

[FAILED] Failed to start Mount ZFS filesystems.
See ‘systemctl status zfs-mount.service’ for detail

Running that command showed me this:

● zfs-mount.service - Mount ZFS filesystems
Loaded: loaded (/lib/systemd/system/zfs-mount.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sat 2020-05-09 01:14:50 UTC; 49s ago
Docs: man:zfs(8)
Process: 1693 ExecStart=/sbin/zfs mount -a (code=exited, status=1/FAILURE)
Main PID: 1693 (code=exited, status=1/FAILURE)

May 09 01:14:50 storj-node systemd[1]: Starting Mount ZFS filesystems…
May 09 01:14:50 storj-node zfs[1693]: cannot mount ‘/storj’: directory is not empty
May 09 01:14:50 storj-node systemd[1]: zfs-mount.service: Main process exited, code=exited, status=1/FAILURE
May 09 01:14:50 storj-node systemd[1]: zfs-mount.service: Failed with result ‘exit-code’.
May 09 01:14:50 storj-node systemd[1]: Failed to start Mount ZFS filesystems.

So it sounds like the root of the problem is my zpool configuration failed in some mannerism. I’m not sure how to troubleshoot this from here or what caused this to happen in the first place.

It’s looking like this is a deeper ZFS problem that may not be directly related to the STORJ software. If we can get ZFS behaving again I’m to assume storjnode should pick it up and be happy again.

And I hope I don’t get disqualified. Starting from scratch really sucks.

I did mount it to the root directory of the storage volume. Is there a particular reason as to why you shouldn’t?

And you make it sound like if I had used a sub-directory it would have resulted in my disqualification. Not sure how directories would have made a difference here. :smiley:

Try unmounting the ZFS pool and ensuring that the directory is empty before remounting, it looks like from this error that either there is data already there that is causing the mount to fail for some reason or that something is already mounted to that directory.

Assuming I understand what you mean by ensuring the directory is empty it is not. This was a previously working STORJ setup with about 6TB of data. Suddenly now it is not. I don’t want to wipe out my whole node because the pool refuses to re-mount properly for some other unknown reason.

Nope, that’s definitely not what I want you to have happen. At some point your node ran without mounting the ZFS volume so your /storj directory likely has data in it when not mounted, that’s what I think that error is from.

Linux systems mount devices to directories, so if there is already data in that directory before mounting it, it is looking like ZFS possibly isn’t liking that. Your other option is to create a new directory to mount your ZPool to and have ZFS mount your ZPool to that new mount point and modify your storage node run command to point to the new directory.

Just spitballing ideas based on the errors that you have posted from the ZFS logs.

Ah, OK. You made me discover something. Even though the pool isn’t mounted. For some reason the directory is still there and df -h reports it as the size of the boot disk.

I renamed this mystery folder and imported the pool again. This time no error messages. Running df -h again I now see my zpool with full capacity both total and used.

Do you think it’s safe to restart storagenode? Just my luck this glitch caused data loss somewhere and the moment I start this up I’m going to get disqualified.

I would venture on the side of saying you’re probably ok. Although if your node was running without the array mounting and taking in data for any decent period of time then if your node gets queried for those pieces it won’t be able to provide them. Not a deal breaker for customer data as the network should accommodate that (although not super ideal from a data resiliency perspective) but if you get audited against any missing data then you may run into trouble.

Also make sure your fstab or however zpools are mounted at boot point to the correct directory.

Then it’s concerning to say this mystery directory does contain storj files/folders so it may have some customer data. Drat.

On another note. It seems restarting the server the pool fails to remount. No idea what’s causing that. This is concerning because it seems reasonable to believe this failure to mount may be what caused the initial problem which means a manual mount will probably result in it unmounting itself again down the road.

Unfortunately I am not super well-versed in ZFS on Linux (or ZFS in general) only having used it a couple of times before and not experiencing issue like this before. Wondering if this has something to do with passing through the HBA, have had mixed experiences with this in the past, but if you’ve been running the node like this for quite a while then I am skeptical on that being the issue.

I’m wondering if the order of operations during boot is that docker and then the storagenode are starting before the disk is mounted, causing data to get added to the directory before it’s mounted causing the failure.

What are the contents of /lib/systemd/system/zfs-mount.service and /lib/systemd/system/docker.service?

I started up storagenode again. I’ll find out by tomorrow-ish if I get kicked from the network. I do have a method I can try for running it not in a VM with HBA pass-though but on the same server. If I get kicked it’ll be my next go-to. I would think as well after about 4 months why would it up’n’quit now?

Though I did replace a disk due to write errors about 3 weeks ago…maybe, maybe not related.

zfs-mount.service:

[Unit]
Description=Mount ZFS filesystems
Documentation=man:zfs(8)
DefaultDependencies=no
After=systemd-udev-settle.service
After=zfs-import.target
After=systemd-remount-fs.service
Before=local-fs.target
Before=systemd-random-seed.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zfs mount -a

[Install]
WantedBy=zfs.target

docker.service:

[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not support it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target

Your node has been suspended on 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW 1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE 121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs 12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB.

I assume if it does start working again getting un-suspended is a matter of being patient and waiting yeah?

Correct, at least from what I have read here. If you are just suspended and not disqualified then your node has to pass a certain audit % to get fully vetted again.

I would be curious about adding into the docker.service file in the “After” section the zfs-mount.service in addition to the other services there so that it waits until after the zpool is mounted before attempting to start. While I have not found anything from some brief digging that this is a known issue, this may be a good workaround.

This would then look like:

After=network-online.target firewalld.service containerd.service zfs-mount.service

After modifying this you would have to run “systemctl daemon-reload” to apply the new configuration but I am wondering if that will help you with this issue. No idea why it started becoming an issue now and not quite a while ago.

I’m trying to avoid presenting really extreme options here to avoid screwing things up for you as you’ve had a decent bit of node availability problems here with this.

My gut recommendation is that if you can get it mounted in a stable manner and try not rebooting the VM for at least a month to give yourself more audit room. What I would do is don’t apply the changes I recommended right away, although you can if you really want to. If you give it a little while before implementing it and then testing it with a reboot then hopefully if it doesn’t work, changing back the configuration and getting your node back online will not result in a catastrophic result.

1 Like

Whatever you end up doing, please update this thread as I’m curious how this works out for you as well as for anyone else running into something like this.

Your wording could be misunderstood. You mean you used the root of the disk as your storagenode base directory? The reason for you shouldn’t is the following. The mountpoint always exists, whether your disk is mounted or not. If the disk is not mounted and you start the storagenode it thinks “hey, there is an empty directory, let’s create a new storagenode”. As soon as you receive the first data your storagenode is doomed to get disqualified. Either because the old data is missing, or if you correct the problem and mount the hard disk the new data is missing.
If you use a sub directory on the other hand, this directory only exists if the disk is mounted. Without it the node will not start and therefore will not create a new node with all the implications.

The opposite is true.