Node keeps restarting after update to 1.4.2

soothill · May 20, 2020, 1:26pm

My node has just updated itself to 1.4.2 and since this has happened I am seeing two errors in the log:-

2020-05-20T13:23:18.010Z ERROR piecestore:cache error getting current space used calculation: {“error”: “lstat config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/2b/udecucjz2d3oyxs6zii2xre3e2moya4grui5h6vnizqgm36p3q.sj1: structure needs cleaning; lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/qb/63ntgmmw2xbdfyas6nhllfcr3a23buejbygiyzs4j4vt2rp6pa.sj1: structure needs cleaning; lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/3t/q75gcp6uccnydjeh6ux63kfwp37d2zdsdqpw4qvcspjtbmytba.sj1: structure needs cleaning”, “errorVerbose”: “group:\n— lstat config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/2b/udecucjz2d3oyxs6zii2xre3e2moya4grui5h6vnizqgm36p3q.sj1: structure needs cleaning\n— lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/qb/63ntgmmw2xbdfyas6nhllfcr3a23buejbygiyzs4j4vt2rp6pa.sj1: structure needs cleaning\n— lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/3t/q75gcp6uccnydjeh6ux63kfwp37d2zdsdqpw4qvcspjtbmytba.sj1: structure needs cleaning”}

Error: lstat config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/2b/udecucjz2d3oyxs6zii2xre3e2moya4grui5h6vnizqgm36p3q.sj1: structure needs cleaning; lstat config/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/qb/63ntgmmw2xbdfyas6nhllfcr3a23buejbygiyzs4j4vt2rp6pa.sj1: structure needs cleaning; lstat config/storage/blobs/6r2fgwqz3manwt4aogq343bfkh2n5vvg4ohqqgggrrunaaaaaaaa/3t/q75gcp6uccnydjeh6ux63kfwp37d2zdsdqpw4qvcspjtbmytba.sj1: structure needs cleaning

The node then keeps restarting itself. Serving some data and then will hit the space error and then cause the node to restart.

SGC · May 20, 2020, 1:31pm

have you tried turning it off and then back on again
a full shutdown of the machine can sometimes get rid of weird errors… but yeah no clue here really…

nerdatwork · May 20, 2020, 2:00pm

Is it Windows or Linux ?

soothill · May 20, 2020, 2:32pm

Linux in Docker been running perfectly fine and has around 7TB of data stored on the node.

anon27637763 · May 20, 2020, 3:16pm

Is your partition table GPT or MBR ?

Did you use fdisk or gdisk to partition the disk?

soothill · May 20, 2020, 3:20pm

Its a bit more complicated than that

I have LVM configured and its an LVM volume of 15TB using XFS. This is sat ontop of a hardware raid setup. Hardware raid controller is not showing any errors or failed disks.

anon27637763 · May 20, 2020, 3:56pm

The partition table question still applies…

The filesystem is secondary to the partition table. MBR with 512 byte sectors is limited to 2TB. If the partition is larger than 2TB GPT should be used. If MBR is used, the sector size will need to be increased to 4096 bytes.

sudo gdisk -l /dev/your_disk

Should provide information on the partition table type.

soothill · May 20, 2020, 4:21pm

Here is the output of that command:-
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
MBR: not present
BSD: not present
APM: not present
GPT: not present

Creating new GPT entries.
Disk /dev/sdo: 175786426368 sectors, 81.9 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): A5AF73A4-C3AE-490B-B4A1-6986692F2758
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 175786426334
Partitions will be aligned on 2048-sector boundaries
Total free space is 175786426301 sectors (81.9 TiB)

Number Start (sector) End (sector) Size Code Name

As you can see there is no partition information because LVM takes over the whole disk and there is no partition on it as you would traditionally think of it.

anon27637763 · May 20, 2020, 4:42pm

LVM can… but sometimes it’s useful to have a partition table…

The problem you listed may or may not have something to do with the partition and filesystem… it’s difficult to determine.

In any case, you can check if there are any warnings about the PV headers:

sudo pvck --dump headers /dev/sdo

But, if you are using an 82 TB RAID array, my guess is you already know how to manage it.

BrightSilence · May 20, 2020, 4:53pm

This is almost certainly a file system issue. Try running xfs_repair and see if that fixes it.

soothill · May 21, 2020, 11:21am

So it seems that the metadata on the XFS volume had become corrupted and an XFS_Repair couldn’t fix it.

Working through the entire volume to see if there are any bad blocks but looks like there are so many errors coming back from the xfs_repair I doubt it will be able to recover it and I dont have faith in it currently.

SGC · May 21, 2020, 12:14pm

maybe worth checking cables and what not… sometimes a bad sata cable can really mess with your day…

anon27637763 · May 21, 2020, 1:16pm

So it seems that the metadata on the XFS volume had become corrupted

I realize it doesn’t help the current situation… but ZFS helps with that problem.

Adding for other readers: Perhaps, it would be useful to add a recommendation in the documentation, if it hasn’t already been done -I haven’t checked, for Linux Docker nodes to run ZFS for large storage systems…

There are so many various configurations to choose from… but the more I see and read on the forum, the more sure I am that ZFS is the way to go for a storj node.

SGC · May 21, 2020, 2:22pm

i like ZFS but also kinda hate that one gets locked into the pool sizes a bit to easily…
but its rock solid, i cannot see a zfs pool dying without neglect…
tho i have only been using it a short time, but i’ve been so mean to it, that it is kind of riddiculous…

pulling the power, pulling the drives from my raidz while running, even down below the redundant drives… its not happy about it… but manage to not loose a byte.
it’s not so easy to working on smaller scales tho… you really want to add drives in sets of like atleast 4 and it quickly adds up…

jmetaj · June 11, 2020, 2:07pm

I’ve been able to restore my storagenode, but want to pass this info along in case it helps anyone.

My neighborhood lost power for about 15 minutes this morning, taking my Linux host down with it. When power returned, the host booted itself back up, and docker restarted the storagenode container (thanks to --restart unless-stopped). Unfortunately, the container boot-looped every 30 - 45 seconds. The final error message from the container was

Error: lstat config/storage/blobs/v4weeab67sbgvnbwd5z7tweqsqqun7qox2agpbxy44mqqaaaaaaa/v4/n47zlhp7aqjmgwdb2jxumxeg3yqrrzgbramrctxte66gjomaeq.sj1: structure needs cleaning

I found this forum item and tried the recommendation for xfs_repair.

`$ xfs_repair -v /dev/sdd
xfs_repair: /dev/sdd contains a mounted filesystem
xfs_repair: /dev/sdd contains a mounted and writable filesystem

fatal error – couldn’t initialize XFS library
`
I wasn’t confident after seeing that fatal error, and took a chance on rebooting and letting the system tell me about fs probs. Fortunately, rebooting did the trick (others here may be able to explain why relative to fsck or other during boot).

My system is now successfully running

v1.5.2 (storagenode:latest as of this morning)
hosted by Docker version 19.03.8, build afacb8b7f0
on Ubuntu 20.04 LTS (Focal Fossa)
Storj data hosted on single drive, Ext4, with 21.2% free space

Hope this helps someone else.

jmetaj · June 11, 2020, 3:53pm

The information above was premature. It turns out that storagenode has been running for 20-30 minutes, then failing with structure needs cleaning. When docker restarts the container, it fails within the first minute, and then it’s back in the bootlooping problem. After rebooting the host, storagenode will again run for longer periods. As I have time today, I’ll try to further diagnose and resolve.

BrightSilence · June 11, 2020, 4:13pm

xfs_repair doesn’t work on a mounted file system. Unmount first, then run it.