Achieving Hot/Cold Storage as a StorJ node operator

cpare · August 5, 2022, 9:14pm

Is there any way for me to utilize my SSD for hot storage and my WD reds for cold (inactive file) storage? I often wonder how much of the data I store is temp/cache data that’s deleted a few seconds/minutes later versus stuff that’s added and stored as a long-term backup - If I could provide 1TB of SSD to cache content until it “aged out” it would reduce the load on the HDDs extending their lives, and probably provide better service to customers.

Note: I am NOT advocating for in-memory ephemeral storage, just SSD -vs- HDD

Pac · August 5, 2022, 9:36pm

I don’t think the Storj network is aware of the “type” (cold/hot) of data that gets uploaded to nodes, at any level.
It’s up to customers to upload long term data or not.

cpare · August 5, 2022, 11:32pm

I don’t want the submitter (or network) to decide, this is about giving the node operator options to increase speed by caching the content expected to be accessed in the near future. Certainly there is some basic rules we could use to create pseudo-logic

New writes go to “hot” storage (SSD)
When “hot” is 80% full move the oldest content to “cold” (HDD) maintaining 80% utilization with room to wait for operations to complete.
When a file “read” request comes in, move all the blocks for that file to “hot” across all storj nodes (network wide) assuming the entire file is eventually going to be read.

This model works especially well for the user with a small (example: 1TB SSD) and a large (example: 20TB 5400rpm) HDD - today those operators likely only offer space on the larger (slower) disk - even if they pre-cached 10GB to SSD it could result in a significant performance boost.

SGC · August 6, 2022, 12:35am

you are basically describing a cache, the exact point of a cache is to load often used data and to buffer writes to reduce the load on the slower devices.

some storage solutions have cache options, stuff like zfs uses memory and optionally a ssd for read cache, ofc writes are also stored in memory, but it’s in a different part and isn’t really part of the cache.

synology NAS devices have the option for using an ssd for cache on raid, not sure if that works for individual disks.

else there are most likely options with lvm for adding a cache device to a slower storage media.

and if there isn’t, there does exist other software for both linux and windows which gives the ability to run a cache.

that being said, cache is a fairly heavy workload, you will most likely need to use enterprise grade ssd’s or atleast consumer ssd’s with the highest level of wear endurance.
there is a reason memory is most often used for cache.

your hdd will also have a cache, and ofc more is better… but a 20TB as you used as an example will usually have like 512MB cache if not more.

the latency improvement from running with a cache isn’t that amazing tho… sure it does speed things up by a lot from a local perspective…

but stuff is uploaded and downloaded over the internet, so that will add tens of milliseconds to the latency no matter what… while a hdd is able to seek in like 2.7 milliseconds for a 7200 RPM disk… maybe i bit less… and then a 5400RPM disk would be like 4 MS seek, avg…
ofc as workloads instead this number can go up radically.

and ssd is down in like nano seconds seek, so really from a customer perspective they won’t feel a huge difference.

ofc cache will save a ton of io load from the hdd, which is always nice… for tons of reasons.
but that is a storage solution thing, doesn’t really have anything to do with the storagenode software.

ZBS · August 6, 2022, 5:52am

Linux already does this using the ram and the disk swap partition, but is not limited to the hard drive operations.

Like @SGC say, other filesystems/virtualization already have another cache system and all good drive have also another hardware cache inside (256MB / 512MB).

So in my opinion adding another layer of cache inside the node software is an overtask, we will end up caching things that were already cached by others…a lot of IOPS for nothing.

SGC · August 7, 2022, 6:00am

don’t get me wrong… cache is amazing for storjs workload.
it significantly reduces the hdd workload.
many people use all kinds of different extra caches on SSD’s to held mitigate Storj IO, with great results.