High disk load, optimalizing

lovaszl · July 28, 2024, 9:47am

Hi!

I would like to ask for advice regarding the disk load.

Under Docker, I have two nodes running on synology (16GB mem) on Seagate x16 16TB disks.
Node1 disk2 14TB full, Node2 disk 5 8TB full, network 1000/300.

The 8TB node, the filewalker after the update, does 100% load for a while, this is normal.
The 14TB node, on the other hand, is constantly running at maximum load, regardless of whether it is restarted or not.

It is possible to somehow optimize this, because I am afraid that the disk will not be able to withstand this load continuously.

Thanks!

littleskunk · July 28, 2024, 10:02am

If you have some SSD space than a ZFS l2arc might work. It is limited by RAM. With your current setup I would say 2 drives, at least 4 better 8 GB of RAM and 200 GB l2arc. I don’t have accurate numbers because my pi5 is limited by RAM and I didn’t test how big the l2arc would need to be.

lovaszl · July 28, 2024, 10:09am

Unfortunately, both volumes are formatted for ext4, so far this was correct. I have a free intel d3-s4610 480 ssd, although it might not be enough for this load.

Alexey · July 29, 2024, 4:28am

It should work too. As far as I know, Synology has a feature to add SSD as a cache layer.

lovaszl · August 6, 2024, 7:07am

Thanks for the reply!

Unfortunately, I can create a read-only cache with an ssd.

When I previously migrated this node from the synology btrfs raid to a separate ext4 recommended design, the disk load was significantly reduced. I don’t understand now why this constant 100% load.
What I also noticed is that even though there is 16GB of memory in the NAS, the total RAM usage is only 16%. I should somehow use more ram for storj running in docker, to see if that would help

Alexey · August 6, 2024, 7:39am

The constant load can be because of several processes:

used-space-filewalkers, they can run days in a Lazy mode;
TTL collector (it’s running every hour by default and removes the TTL expired pieces);
trash-filewalker (it’s running every 24h to clean the trash);
garbage collector + retain (it’s running for each received Bloom Filter from the satellites) - roughly one or three times per week;

All of them can run in parallel if the lazy mode is enabled. And the retain has had 5 concurrency (meaning processing 5 satellites in parallel), it has been changed recently to 1 by default, but if you have had this setting in your config, it will use that.

lovaszl · August 6, 2024, 8:20am

Alexey, I invite you for a beer

I enabled these options on my two nodes:
–storage2.piece-scan-on-startup=false
–pieces.enable-lazy-filewalker=false

the load drops from 99% to 8-12%, thank you very much!

The other option you mentioned is currently default:
“# retain.concurrency: 5”

Should I set this to “1”?

pbodq2 · August 6, 2024, 8:20am

There are a few shortcomings of Synology DSM bear in mind

All disks are having one RAID1 mirroring partition for OS running, no matter you setup the pool as RAID 5/6 or whatever. This is an implicit design of DSM for the purpose of resilience.
All the disks have to spin for write-sync operations of that partition. Even you have already dedicated one disk for one Container usage, that disk still have to handle OS type I/O.
A lucky situation is that, read operation in that RAID1 does not involve all disk members. Only the primary disk nominated does this task. Your X16 is the poor busy guy.
You can verify this structure by its built-in Linux mdadm.
BTRFS is not the cause. To certain extend, BTRFS is even better than ext4 in Synology world. I have a 14TB WD with nearly full Pieces on DSM v7.2, its peak utilization period during Garbage Collection process is very acceptable even though the Bloomfilter file is not small. Your graph is quite surprised to me.
Synology Docker writes Container logs in OS/APP partition mentioned above. If your STORJ log level is set to a huge scope such as INFO, that disk could be greatly affected in performance.
I suffer this before. Now, I only choose the level I am interested in order to reduce the frequency.
By editing the config.yaml, I also raise the filestore.write-buffer-size. Increase the probability that the write head could complete more customer requests by one track. (Um… yes I have many RAM equipped…)

The most helpful approach is adding a SSD and map those database files to it via config.yaml. I think it is even more useful than setting up SSD cache pool. By doing that you can avoid the ERROR of locking DB files. Of course, the latter approach is very easy to manage in DSM, that is Synology’s advantage.

agente · August 6, 2024, 10:55am

I think from ver 1.108 default is 1

nerdatwork · August 6, 2024, 11:09am

Alexey · August 8, 2024, 8:07am

Please note, this would disable the used-space-filewalkers, and the trash usage would be incorrect (until this bug would be fixed in the next release). So it’s better to remove this option for the next release to correct the usage in the databases (and on the dashboard).