High load due to high disk usage

dugwood · November 2, 2023, 9:02am

I’ll try that (I’ve seen it in the other thread). It seems that there’s a restart, which then runs filewalker, and then there’s a high load spike. When I restart my nodes (for example on a Docker bugfix update), I don’t have that spike.

64GB

So not really useful… But I’ll test it to see how much time it takes to load.

Yes, it’s limited to 1TB, so it’s about 100% of the expected space

Thanks for pointing that out!

Yes, but with RAID set up, most of the I/O is on the RAID service, so you don’t see easily where the writes occur.

Main issue is that it happens from time to time, 99.9% of the time everything’s fine.

@Alexey is there a configuration for the number of concurrent reads/writes? I only see settings for concurrent connections.

arrogantrabbit · November 2, 2023, 9:22am

Magnificent, this confirms the hypothesis!

So, you have enough ram to fit metadata and that’s why restarting just the node is harmless. But when you restart the machine — cache is of course empty and disks are overwhelmed by needing to also service metadata fetches.

As a workaround, unless you can implement some persistent cache solution (no idea what’s available on Linux) you may want to run stat on every file before the node starts (e.g. find /mnt/storagnode | xargs stat ) to pre-heat the cache by forcing all metadata read.

On the other hand, this is exactly what filewalker on start is going to do anyway. So, it’s already works as expected.

If you disable filewalker you will avoid this spike on start, but this means every file access will be taking longer and your node will be losing more races.

I see. Databases, however do keep growing. But it does not matter. It seems the reboot and empty cache as side effect is culprit. Your caching works as designed otherwise.

Maybe don’t reboot the server

dugwood · November 2, 2023, 9:49am

Maybe I wasn’t clear about it, but it’s not a reboot, it’s a restart of Storj because of the timeout.

I don’t have the issue if I restart the server. Or if I restart the Storj node. Only if it’s restarted because of the timeout issue. At first I just thought it was a planned task, but the logs indicate that there’s a specific reboot of the node.

dugwood:

2023-11-01T15:03:13Z    ERROR   services        unexpected shutdown of a runner {"process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:176\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:165\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

Or perhaps it starts here.

Anyway it’s clearly a high disk I/O that create the cascade of events.

daki82 · November 2, 2023, 12:06pm

May we hear about the disk typenumber , manufacturer, size?

Alexey · November 3, 2023, 4:51am

I do not think so,

> ./storagenode setup --help | sls "conc|write|read"

      --filestore.write-buffer-size memory.Size                  in-memory buffer for uploads (default 128.0 KiB)
      --graceful-exit.num-concurrent-transfers int               number of concurrent transfers per graceful exit worker (default 5)
      --pieces.write-prealloc-size memory.Size                   file preallocated for uploading (default 4.0 MiB)
      --retain.concurrency int                                   how many concurrent retain requests can be processed at the same time. (default 5)
      --storage2.max-concurrent-requests int                     how many concurrent requests are allowed, before uploads are rejected. 0 represents unlimited.
      --storage2.min-upload-speed-congestion-threshold float     if the portion defined by the total number of alive connection per MaxConcurrentRequest reaches this threshold, a slow upload client will no longer be monitored and flagged (default 0.8)
      --storage2.monitor.verify-dir-readable-interval duration   how frequently to verify the location and readability of the storage directory (default 1m0s)
      --storage2.monitor.verify-dir-readable-timeout duration    how long to wait for a storage directory readability verification to complete (default 1m0s)
      --debug.trace-out string           If set, a path to write a process trace SVG to```

maybe only retain.concurrency?

dugwood · November 4, 2023, 3:10pm

You mean on non permanent memory? Such as /run/shm/ (which is a Linux direct mount on RAM)? That would recreate the database at every restart, isn’t it?

As the issue wasn’t there « before », I don’t think it matters. Here’s some of my disks:

Disk /dev/sda: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: HGST HUH721008AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: A1DAC874-C203-43C0-AD9E-F332B525B73A

Another one, on another server:

Disk /dev/sda: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: HGST HUS724020AL
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 258B49DD-2F12-46A2-9D47-006E1A56977D

As it includes « requests », I think there’s nothing with I/O limitation. For example I can serve 100 clients at the same time, I may be limited by my bandwidth and not by disk’s speed.

Perhaps there’s an issue with Linux 6.x kernels too, as a chown was blocking too. I’ll wait for a new lock, after I’ve restarted all my nodes using a storj user instead of root user. Which may have less access to I/O priority, and may not lock the system down as it used to do for the last weeks/months on some filewalker runs.

dugwood · November 4, 2023, 5:52pm

I’m running a chown storj:storj on all directories of the storage, one by one and by chunks of 100 files. There’s 16,439,647 files in the storage. As it writes the new owner, I get the same high load (~16) as the one generated by the filewalker. So it seems that’s an issue with RAID (md arrays) and/or something like this.

Question: why does the filewalker writes to the disk? I mean a lot of writes? Perhaps it does the garbage cleanup, which may need to delete a lot of things.

I’ll strace the process next time I’ll see it.

Toyoo · November 4, 2023, 10:13pm

The used file walker does not write anything (except maybe updating the numbers in the database at the end of the proces).

The GC and trash file walkers perform writes for each file found to be removed by customers. We do recently observe (one example here) that most data uploaded by customers is removed, which suggests a lot of file operations. It wasn’t so in the past, this is a change in (average) customer behavior.

daki82 · November 4, 2023, 10:44pm

nice ingress today tough

Alexey · November 5, 2023, 3:27am

the quoted option is related to I/O:

And “request” in this context is to check the expiration state of the piece and move it to the trash.

dugwood · November 5, 2023, 7:54am

I’ve run fsck which gave:

/dev/md10: 14476221/244047872 files (0.1% non-contiguous), 713326650/976165888 blocks

So it seems there’s not much fragmentation on this disk.

OK, sorry for the misunderstanding.

Thanks for this information. As we can see in my first screenshot, it’s indeed the used file walker, with is in « R » state (running), so not blocking. The node is in D state (blocking/waiting for resources), so it may be a combination of the two that lead to my high I/O. As if there’s a lot of reads, the disk (which is not a SSD) can do a lot of writes too.

I must add that running a big chown on 4M Storj files, as root because all files are root-owned, goes nuts too. Adding a pipe to ionice -c3 got the load to stay at a reasonable level (~3 instead of ~16). So it goes back to my hypothesis that running the Storj node as root may enable the node to exhaust I/O and make the whole thing really bad. AFAIR the option with a dedicated user didn’t exist at first on Docker, it was added maybe a year’s back and I didn’t know of. So I’m switching to a dedicated user, and will see if it works better (I’m quite sure it will).

daki82 · November 5, 2023, 12:07pm

nice, we can rule out fragmentation.

colud not find if disks are cmr or smr…guess they are cmr then.

Alexey · November 5, 2023, 12:14pm

yes, I believe there is a problem with RAID, since

The regular write small operations must not freeze the whole system.

Toyoo · November 5, 2023, 3:17pm

Oh, one idea here. mdraid by default uses a large write-intent bitmap, which may even double the amount of writes in some circumstances. I personally have seen this happening long time ago. I no longer use mdraid for anything, so I almost forgot about this problem…

I find it likely that a large chown could trigger this case for storage node data on a default-formatted ext4.

To test this hypothesis you can probably just remove the bitmap as a temporary measure. You can add it back again later.

If this hypothesis is true, there are some ideas how to work around this problem:

Use an external bitmap.
Reduce the size of a bitmap (that is, use large mdadm --bitmap-chunksize), so that there’s bigger chance that a single bitmap write will cover many intended changes. In case of failure this means larger rebuilds though, so this is a trade-off. Also, it does not always help.
Next time you format ext4, set mke2fs -g «blocks-per-group» to a large number—maybe even go for a single group for the whole file system. This improves chances that inodes modified are close together, but probably also makes regular operations a bit slower, so it’s a trade-off.
Switch to a (good) hardware RAID controller this is actually one of the two benefits of hardware RAID controllers, as they usually have write-intent bitmap/log stored in their internal, battery-backed RAM.

dugwood · November 6, 2023, 1:07pm

Thanks for the info. I agree that the issue doesn’t seem to come from Storj, but Storj may have it triggered with more concurrent requests.

For the RAID, I see that I have one server with RAID10:

md10 : active raid10 sdc2[2] sdb2[4] sdd2[5] sda2[0]
      3904663552 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 8/30 pages [32KB], 65536KB chunk

Bitmap is active, also on other servers with RAID1:

md2 : active raid1 sdb2[0] sda2[1]
      7812843520 blocks super 1.2 [2/2] [UU]
      bitmap: 17/59 pages [68KB], 65536KB chunk

Oh, I didn’t get it well: you mean disable it, of course. As it may takes hours to complete, I’ll try this when possible.

Moving to RAID hardware instead of software is of course a good option, but really expensive so far (rented servers, so it’s not a one time purchase!).

Do you stopped using mdraid because you use hardware solutions? Or you don’t mind having to reinstall everything on a disk failure? Or perhaps you only use SSD, which should last longer… If there’s no intensive writes

Toyoo · November 6, 2023, 9:01pm

I’ve never used RAID, or any mirroring/parity solutions for storage nodes.

I used to sysadmin a bunch of servers in my previous company. It was a small startup-like place, so we did not have a dedicated sysadmin, and I had to take some of these responsibilities myself. They had money for availability, so it was all RAID and stuff. Some hardware raids, some software, mostly because until recently there was no good solution for hardware RAID on NVMe drives. This is also where most of my experiences with RAID come from. I’m no longer there though, and now working for a bigger company I’m rather far away from the infrastructure stuff.

For home stuff and the storage nodes themselves, I don’t need availability. I do have good backups though. For example, if the boot/system drive in my storage node box failed, I could have it back in maybe 20 minutes, with all services running. And given this box is all very old hardware (HP Microserver gen7 from 2013 with a 3ware 9690 controller card from 2007) except for drives, I suspect failure of a motherboard, power supply or the controller is more likely than an almost new 20 TB drive. So I don’t really see the point of securing against specifically HDD failure.

I also have a bunch of various old drives for some personal projects. Drives work under btrfs raid1 and btrfs raid0 modes, built mostly because I wanted to consolidate them into a single file system for convenience while using small drives.

arrogantrabbit · November 6, 2023, 9:16pm

I think the opposite is true. Old time proven hardware that worked for decades will likely continue working. New drives on the other hand may have factory defects.

I’m referring to the infant mortality slope of the bathtub curve.

Same. TrueNAS config backup file is a script that configures the system from the blank slate to the current configuration. Restoring the backup essentially replays those commands on top of the new system. So if the boot drive fails, I just unplug the thumb drive, plug back another one, and restore the system state. Much less than 20 minutes.

I took the opposite approach. Absolutely everything is on a single high performance pool. Everything. Storagnodes, backups, databases, services, media storage, … everything.

I took measures once to m make it fast – and every single piece of software or service benefits from it. I’m also using old used drives, never new, so probability of failure it’s even lower. In fact, after switching to exclusively buying new drives I never had disk failure in the past 8 years. I did replace a handful due to needing more space, but I never had a failed disk. RMAing drives was a regular occurrence for me before.