1.3.3 i/o maxed out on Debian stretch/sid

freedata · May 3, 2020, 8:39am

After an automated watchtower upgrade to 1.3.3, the i/o wait is constantly very high across an 8 cpu dedicated node.

The dashboard is understandably slow in this state. It used to take 1-2 seconds to come up, now it takes over 5 minutes.

There’s about 6.5TB of data on this node.

It was taking consistent, high traffic before the upgrade, but now I’m guessing its too slow to respond and its handling less requests. Bandwidth usage is down.

Any suggestions on where to start figuring out what the problem is?

MrPsycho · May 3, 2020, 8:51am

Maybe start from installing iotop and confirming if the high io usage comes from StorJ?

donald.m.motsinger · May 3, 2020, 11:10am

Check the fragmentation of the database files with filefrag. I had a similar problem and seemed to have solved it by defragmenting them.

SGC · May 3, 2020, 12:36pm

Generally iowait is caused of waiting for the disks to complete their tasks…

@donald.m.motsinger thats interesting, maybe thats why i’m seeing so much activity from storj…
running on 5 drives and still seeing them having to work for a living… tho i’m on zfs so not sure how big of an issue fragmentation is for me…

also just copied my entire node… 6TB and i duno something like 3-4million files… it wouldn’t surprise me if people end up having trouble with that level of a file system eventually… atleast on NTFS
tho not sure…

@freedata
i don’t think your particular bandwidth is down… i think that is a collective thing… i’m also at 30-50% less ingress than pre 1.3.3, but i doubt it’s the version fault… most likely just less data coming into the network atm…

Have you tried turning it off and on again??? xD
what setup are you running?
most likely your disk io slowing you down… you don’t have some antivirus or something that is set to scan the blobs folder?

that will ruin your day, best to set it to exclude atleast the blobs folder or maybe the entire storagenode folder… personally i exclude the whole storagenode folder… since it’s not active data for my system… just stored data blocks… so should be safe, viewer discretion is advised…

scanning the blobs folder with a deep scanning antivirus, could take a week on a big and slow eternal hdd

freedata · May 4, 2020, 3:18am

I ran filefrag on storage/orders.db*:
orders.db: 180 extents found
orders.db-shm: 3 extents found
orders.db-wal: 15 extents found

For context the sizes are:
582M May 3 23:12 orders.db
128K May 3 23:16 orders.db-shm
49M May 3 23:16 orders.db-wal

These don’t look so fragmented.

freedata · May 4, 2020, 3:19am

Yes, it is coming from storj. This is a dedicated storj node, and storjnode is the only process with cpu over .3%, hovering around 70% consistently. 6 cores are showing 32-82% iowait, and 2 cores are idle.

freedata · May 4, 2020, 3:31am

Yes.

This is a ext4 system.

The stats on the system until 58h and 32 minutes ago showed the system iowait was minimal and cpu usage was a fraction of what it is now. 58h and 29 minutes ago, the node was watchtower auto upgraded to 1.3.3

The node went online with the first batch of beta users, many months ago.

True the bandwidth is variable. But it had been in clear aggregate pattern for weeks until 58h and 29m ago.

This is a dedicated Debian 9 system. No antivirus. vlan’d in isolation, external network security in place.

Thanks for the ideas.

I was thinking 1.3.3 might be doing some db upgrade thats consuming i/o. Not sure, but i/o and cpu are being hammered non-stop now.

SGC · May 4, 2020, 8:09am

i’m on buster i think, the debian 10 is called if its 10 lol latest one 2 months ago.
and tho i see a lot i io to my drives a raid 5 with 5 drives, the it doesn’t require any cpu or ram… i think the docker storagenode was at 100mb last i checked, tho it had been rebooted that day, but i’ve never seen it above like 300mb and cpu usage is on my 16threads usually in the single digits even tho its a couple of the slowest 10 year old xeons.

and my node is also only slightly less than 2 months so fragmentation isn’t an issue, am up to 6tb and all i really see if high io all the time… and have for a long long time, more io than a single hdd could keep up with…

sure there might be some new programming performance issues in 1.3.3

have you tried setting max concurrent to like 15… in the config.yaml i’m at 20
but i have 4 drives to soak up the data stream and io and 15 was to low for me recently.
that seemed to help me reduce my io… and it’s always disk io with store…
but think about it

in the last 24 hours i’ve served about 100000 uploads / downloads
thats a lot of roaming around the drive reading data… especially if the data is fragmented which it quickly can become if one fills up a hdd, seek time can be a killer for a hdd, because fragmented data can slow down a disk to a crawl.

else stuff like a write cache can help make the writes more sequential and thus lower io.

anyways try changing the max concurrent in the config.yaml in your storagenodes data folder
i think it helps a lot keeping the system and the internet running smooth.
it is a bit like without limitations the storagenode can just choke hdd’s or internet