So I move the sqlite db of storage node to a separate ssd disk to improve the performance and after that I notice that the ssd lifespan of my ssd is decrease like crazy, around 1%/4 days
using the, inotifywait cmd I do see a huge amount of disk activities with *.wal db file
``inotifywait -m -e modify dbpath
So my question is, is there any setting/config to mitigate/avoid this ?
The wal files are indeed written to frequently, but these are not synchronous writes, so they should not be contributing that much to SSD wear. So you will see a lot of them in inotify, but the kernel should be smart enough to only send a write command after collecting several writes to these files. If this is not the case, then you indeed have a problem—but that problem is with kernel resources, not node settings.
As such, I’d try looking for other sources of writes.
I’m think this is the issue in my case, and to test my assumption, I already move sqlite db file to a separate hdd disk and from that point my ssd lifespan doesnot decrease at that rate anymore
From my understand so correct me if I’m wrong, *.wal are normal physical db files which need to write to disk so with storagenode with hunder thousand of small segment the amount of data it need to write to disk I think is huge
.wal files are written sequentially, and with no forced synchronous writes, so in theory they shouldn’t be a problem. But if moving the database files elsewhere solved the problem, then this clearly was a problem for your case.
Frankly, I would really like to learn why it happened.
What file system are you using on that SSD? Do you have any RAID there, or any other volume management subsystems?
This looks like write amplification. Perhaps your SSD pretends to have 512 byte sector size.
This is about the wal file: Write-Ahead Logging. You may be able to control it somewhat.
If I were you I would put databases to tmpfs.
Micron SSD vary drastically in how they handle writes. They sell a separate class of SSDs optimized for small writes. Whatever they use for the consumer Crucial brand is likely the bottom of the barrel.
By messing with sqlite you will introduce instability. Since databases storj is creating have negative utility value (zero pros, all cons) you can keep them in ram on tmpfs instead.
It will be more stable and use zero SSD endurance.
This will radically solve the problem.
If for some reason you want them to persist, you can copy them to randisk at start and copy them back on persistent storage on shutdown.
Moving wal to ram is entirely pointless. It’s a journal, to save the day when power is lost during transaction. Keeping it in ram defeats the purpose.
Read them into Ramdisk on boot and offload them back to persistent storage on shutdown. Two lines of script
4 kb record size is too small. Don’t change defaults unless you have a very good reason to do so. Default on ZFS is 128k, and compression on.
I understand your SSD is in its own separate pool?
You would need to destroy is and create a new one with 4096 sectors side. (zpool create -o ashift=12 …). You can look up the current pools ashift with zfs get ….
ashift hints zfs about the physical sector size of the underlying storage device. It’s a setting for storage devices, not specific dataset.
Recordsize is a maximum size of a logical block used to store data. It’s a setting for individual datasets.
You cannot change the ashift without removing the device first.
Here is an interesting article (I skimmed though it, did not read it thoroughly, but here is a relevant quote:
ZFS admin is strongly advised to be aware of the actual sector size of his or her devices, and manually set ashift accordingly. If ashift is set too low, an astronomical read/write amplification penalty is incurred—writing a 512 byte “sectors” to a 4KiB real sector means having to write the first “sector”, then read the 4KiB sector, modify it with the second 512 byte “sector”, write it back out to a new 4KiB sector, and so forth, for every single write.
In real world terms, this amplification penalty hits a Samsung EVO SSD—which should have ashift=13, but lies about its sector size and therefore defaults to ashift=9 if not overridden by a savvy admin—hard enough to make it appear slower than a conventional rust disk.