Node suspended on us2 and europe-north-1

mi5key · January 20, 2023, 10:20pm

Got this message today on the dashboard

This has been running all day, no failure messages

sudo docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep -i failed

I have Uptime Kuma running in GCP hitting both the IP:port and DNS:port, it has not failed a ping.

I don’t know what’s wrong.

mi5key · January 20, 2023, 10:52pm

RAM seems to be spiraling out of control

htop

sudo docker stats

Stob · January 20, 2023, 11:47pm

I don’t know what’s wrong but my gut reaction would be to fully shutdown and restart the node.

You could grep the logs for ERROR lines first. Audits seem fine as they’re at 100%.

mi5key · January 20, 2023, 11:53pm

Yeah, that’s been the temporary fix the last two times. The problem is docker stop storagenode does nothing, it just doesn’t respond. Sometimes need to hard boot the box which has it’s dangers also.

Stob · January 21, 2023, 12:02am

@brandon could this be related to the audits being turned on for us2 and europe-north-1 ?

@mi5key could you post as much of the docker logs as you can so we can see what’s going on.

```
Post logs between three backticks
```

mi5key · January 21, 2023, 12:42am

Log killed the browser tab, 65MB

Memory is is at 1.6GB and rising after a full restart.

mi5key · January 21, 2023, 1:37am

sudo docker logs storagenode - no longer responds.

Anything from anyone at storj?

Knowledge · January 21, 2023, 2:28am

Certainly if you are using SMR drives, that may be an issue since they can perform slow writes which causes your cache to fill up and would explain your high memory condition.

Of course, there are many different configurations of nodes, so it’s difficult to guess what might be causing your issue. Have you made any changes to it lately?

mi5key · January 21, 2023, 2:44am

These are non SMR drives in a Synology NAS. No issues for years until recently. Docker is current and no updates, No DSM updates within the last 3 months, not sure when the last Storj update was.

Seriously looking at downgrading to an earlier version of storj if that’s even possible.

Knowledge · January 21, 2023, 3:10am

Well I know @BrightSilence runs at least some of his nodes on Synology NAS hardware. So, he might be able to help you troubleshoot this.

BrightSilence · January 21, 2023, 3:40am

I don’t recognize this issue. Not seeing it on my nodes. But then not all Synology NAS’s are the same.

@mi5key are you on DSM7 yet? If so, you can now see IO wait in the resource monitor on the CPU tab. I’m guessing it’s high. (this is usually the cause of high memory usage)

Either way, I’d need to know some more info to see what bottlenecks may be.
Which model NAS do you have?
How much memory does it have now (have you upgraded it)?
How many HDD’s and what kind of array?
Are you certain none of them are SMR? (unfortunately HDD manufacturers used to hide this for a long time and it’s still frequently only shown on spec sheets)
Do you use SSD cache, if so is it read/write?
How many nodes do you run on the system?

It would speed up the drop in score, but I doubt it has an impact on performance, seeing the increased RAM usage suggests this is probably an unrelated issue. I’m not seeing this on any of my nodes. So I’m pretty certain the answer is no.

Alexey · January 21, 2023, 4:14am

Looks like your Synology have problems with high RAM usage, or your disk is dying.

This likely will not help. If your node just hanging, there are hardware issues, either with storage or with RAM.

mi5key · January 21, 2023, 4:41am

This is a DS1821+, eight 8TB drives, WD and Seagate Ironwolf, no SMR, I’m 99% sure, did a lot of research before picking these drives, 32GB of RAM (not Synology approved, but it’s been installed for 3 years now).

I’m on DSM7 (DSM 7.1.1-42962 Update 3, up to date) and IO wait is averaging 1-2% currently, load average is 0.86.

Docker stats shows process ram at 2.9GB, it was ~5GB earlier so it has gone down after the restart. Not sure what it normally was as I’ve never watched it this long to know the ‘normal’ baseline.

docker logs storagenode will not respond still.

SSD r/w cache.

All disks including ssd are reporting good.

Alexey · January 21, 2023, 4:46am

This is too high for the storagenode, I suspect that your storage is pretty slow. This could be indication of problems with at least a one disk.
I would recommend to check disks via S.M.A.R.T.

Just for reference

CONTAINER ID   NAME           CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O   PIDS
1118c546300b   storagenode2   4.30%     195.6MiB / 24.81GiB   0.77%     81.9GB / 114GB   0B / 0B     76
01b285fe9949   storagenode5   0.16%     69.98MiB / 24.81GiB   0.28%     733MB / 2.01GB   0B / 0B     36

But these nodes works on separate disks, no array, this is a Windows docker desktop, so it also a Linux VM.

mi5key · January 21, 2023, 4:51am

storj is the only part of the node that is hanging. Everything else is operating normally. The other 4 containers work great. They respond immediately to logs request.

I can exec a bash shell into the storagenode container just fine, it responds immediately.

mike@mi5keyNAS:~$ sudo docker exec -ti storagenode bash
root@mi5keyNAS:/app#

mi5key · January 21, 2023, 4:52am

Yeah, plenty of space available, 20T free

mike@mi5keyNAS:/volume1/docker/storj/config/storage$ df -h
Filesystem              Size  Used Avail Use% Mounted on
/dev/md0                2.3G  1.8G  403M  82% /
devtmpfs                 16G     0   16G   0% /dev
tmpfs                    16G  248K   16G   1% /dev/shm
tmpfs                    16G   21M   16G   1% /run
tmpfs                    16G     0   16G   0% /sys/fs/cgroup
tmpfs                    16G  2.2M   16G   1% /tmp
tmpfs                   3.2G     0  3.2G   0% /run/user/196791
/dev/mapper/cachedev_0   42T   23T   20T  53% /volume1
/dev/usb1p1              13T  880G   12T   7% /volumeUSB1/usbshare
tmpfs                   1.0T  1.0G 1023G   1% /dev/virtualization

Alexey · January 21, 2023, 4:52am

because storagenode uses your storage unlike anything other. The container itself is not the reason. If your storage is slow, then storagenode will use more RAM to buffer uploads to your node, because the disk is not able to keep up.

and again I’m talking not about how much free space on your disk, the disk is slow to respond/store data. Perhaps reads too, because it leads to low suspension score, so you definitely have issues with audits, you may search for exact errors:

So please, check your disks first. I meant with fsck (or with UI analogue on your Synology - here I’m not sure how to do it on Synology), then S.M.A.R.T. with tools available for Synology.

mi5key · January 21, 2023, 5:03am

This is the same setup that I have been running for about 2 years. No recent changes. No bigger demands on the box.

Just ran SMART on the WDs and IronWolf Health on the Seagates. All came back fine.

IO wait is low
Screen Shot 2023-01-20 at 9.01.15 PM

mi5key · January 21, 2023, 5:06am

I’m going to look into offloading the docker logs to the array and see if anything shows up in there.

snorkel · January 21, 2023, 6:04am

I whould stopnode and rm all of them, just to clear the logs. Then I’ll start them with Filewalker off (see the topic Tuning the Filewalker), to see if the FW is the culprit. You will see high I/O from the garbage colector, but for a short period. Also, I whould set loglevel=error to keep logs at a manageble size.
Other things to consider on Synology:

memory compression OFF
DDOS protection OFF
Specter and Meltdown protection OFF (this one I’m not sure if matters for the performance of the node; I didn’t saw any difference on a 4 TB node).

Activate the Performance history to see how resources are used after restart.
Restart the Diskstation, manual update the nodes, than start the nodes.
Also, I see you use SSDs for cache, maybe they are the problem? BrightSilence had a nasty surprise with them.
Also… Docker is reporting wrong memory usage; I think it shows also the cache used, not just the used memory.
You should see the true usage in Performance tab, after running the nodes for 12 hours.