Node suspended on us2 and europe-north-1

BrightSilence · January 21, 2023, 10:07am

It’s really hard to do unless DSM has detected a problem already. Then it will offer the option to run a check on reboot. If not it will require manually stopping a lot of DSM services over SSH and unmounting the array. I had to do it a few years ago and had to search for a guide. I don’t recommend this as a first step. It also takes ages on large arrays.

To be honest, I think you’re on the wrong track with storage performance. They’re using and SSD read/write cache and IO wait is low.

@mi5key have you had any unclean shutdowns of your NAS. If so, then fsck may be an option and I can try to see if I can find the guide I used a while back for this.

Yeah, but SSDs tend to fail hard. Synology would have notified them or crashed, like in my case. (I’m still a little pissed about how bad the NAS handled that on a redundant cache. It looks like the cache works a little differently on DSM7, so hopefully they fixed that. I was still on DSM6 at the time)
@mi5key you could try removing the cache and recreating it if it was made on an older version of DSM. Your NAS will show this if it is running an old version of the cache system. Look for that in the cache info in storage manager. Of course with a new cache the first file walker runs will be slow. I wouldn’t turn it off though. With an SSD cache file walker barely has an impact. How large is the cache btw?

mi5key · January 21, 2023, 8:18pm

Removing the SSD cache now. Will work on the rest of the stuff also. Got suspended from two more this morning. Got nothing to lose now.

mi5key · January 21, 2023, 8:22pm

The only unclean shutdowns have been when the storj docker wouldn’t stop and hung. I’m removing the SSD cache as we speak.

According to watchtower, storj was upgraded on 1-5-23 to 1.70.2. Directly after that my problems started. If nothing changes from the suggestions above, will try the rc candidate 1.71.0-rc. After that, I’m out of options and I’ll be kicked out of storj before the issue can be fixed.

Prior to 1.70.2, I was all 100% on everything. Now, I’m not. Nothing else has changed on the array. No errors reported, granted that DSM may not be the best at it, but every other purpose I have for the NAS works 100%. Plex, and Surveillance Station, and usage has not dramatically changed.

These disks have been together for 2 years or so, no issues.

snorkel · January 21, 2023, 8:35pm

You run different nodes on disks? Or one big node on RAID, with all the disks? If you have more than 1 node, they are not upgraded in the same time. For my 2 node Diskstation, I see a few days difference between updates. All my nodes on 8 Diskstations are at 1.70.2 now, and they work perfectly.

mi5key · January 21, 2023, 8:37pm

It’s one big RAID with all the disks. Only running one node right now.

snorkel · January 21, 2023, 8:42pm

This is why Storj team and old forum members recommend not using RAID. When it fails, you loose everything on all disks. There is no advantage as income for running 1 big node vs multinode.

mi5key · January 21, 2023, 8:57pm

All disk speeds are 7200 RPM. Cache has been removed from the volume. Thanks for the suggestions, this is truly frustrating.

mi5key · January 22, 2023, 12:24am

Ok, turned off most of the stuff mentioned earlier, Mem Comp, DDOS, Spectre. Limited the container to 800m, offloaded logs to the array, stopped the storagenode docker, rm’d it to clear logs and rmi the image to re-pull it just for completeness sake.

Been up for about 45 minutes, docker stats shows better numbers. Hopefully this fixes something and numbers improve.

I’m also rsync’ing the storagenode to an external drive and prepping an rpi in case I need to move this node to that.

mi5key · January 22, 2023, 6:16pm

Everything has returned to normal, and all nodes unsuspended. My guess is that the logs were filling up, bloating the container. I’ve put log rotation into place. That’s what happens when storj runs so well for so long, you forget about the little stuff.

Thanks all for your help.

snorkel · January 22, 2023, 9:21pm

You can set it to error; the info level should be used only for special ocasions.