Running multiple nodes on the one pool (RAID)

Originally I already had the pools set up on an existing server so I just decided to give it a shot. Multiple nodes was mainly for manageability later on as I suspected the pool might not be the best option in the long run. That and I would also have the ability to run them in different locations later on as I have access to several internet sources.

As for moving the databases to an SSD, I may still do this in the future. Right now they seem to be running well enough on separate drives, but assuming Storj continues to grow I could see this becoming far more beneficial in the future. At the moment, Storj nodes don’t make anyone a ton of money even for those willing to throw PB’s of data at it, but if / when this changes I have every intention of optimizing my systems as needed in any way I can.

By the way, I probably also read this when I first set up the nodes leading me to believe it was fine at the time.

Here it specifically mentions using raid arrays. Although it says it’s not recommended, the only downsides mentioned is related to the nature of raid itself and says nothing about IO limitations or any other issues with running nodes on a raid array. Just pointing this out as it should probably be changed or taken out altogether knowing how poorly it performs running multiple nodes.

It is still fine, just useful only when you use one node per HDD (like you reconfigured), because this way they works as a RAID on the network level, with one disk failure only part of common data is lost, not everything.
But when you setup them all on a one disk (or pool), this makes no sense, if disk (pool) would be corrupted, all nodes will suffer, this is even if we do not consider competition for the same resource (disk/pool).

This is specifically what I’m referring to though. Using raid is mentioned in the section about setting up multiple nodes which you say is not good, and I completely agree with, but since many people seem to be using raid arrays I would think the issue of IO limitations especially with multiple nodes should be more strongly emphasized. I often see people talking about raid in the sense that it’s not NEEDED for Storj, but make no mention of the performance penalty when running multiple nodes which I would imagine most people pooling their disks together are probably doing.

I’ve run many raid arrays over the years and never lost a single one. I have however lost multiple hard drives along the way, so from a data integrity point of view it’s the obvious choice. Storj however introduces a whole new concept to the IT infrastructure, and telling someone in IT that “hey, you don’t need raid. It’s totally cool if you loose our data.” well… it kinda takes a minute to get used to. So yes, I think it’s important to point these things out clearly especially for those of us who use raid as common practice.

That’s what I’m thinking too.
Now I have 2x16TB in raid1 with ssd rw cache in synology and I use it mixed for storj and nas use. The storage capacity is 8TB for storj now
From a story point of view, I needlessly spin the 2nd disc. However, if I disassembled the raid and migrated it to a disk and it would fail, it would take years to reload 16 TB.
I’m thinking of buying a 10 TB disk and moving it to it, and when it’s full, I’ll buy another 8 TB to protect the money. Large disk large risk

There is another side to it: you can improve node performance though caching: both read and write. On ZFS that would be ARC, L2ARC (for lookups and fetches) and SLOG for sync writes (databases). These cache virtual devices are mounted on a pool. Therefore if you have multiple nodes on a single pool you just need one set of devices to attached to that pool to accelerate all of them. With nodes on separate pools — each pool will need its own cache devices. This multiples the cost.

Along the same line, if I already have a fast and responsive pool for other reasons, running storj on a separate drive would be counterproductive — my pool does not benefit from extra storage and storagenode does not benefit from caching.

Yes, but now you introduce another point of failure specifically with a write cache. If it fails you loose some of the data on all nodes running on the pool and especially if it’s database files I can’t imagine that would be to good. Might be relatively rare, but still a possibility.

Also, and this probably won’t be much of an issue until storj really takes off, but Storj data tends to come in and out at a pretty steady rate which means theres really no “down time” for the disks. A cache can help to smooth out those highs and lows, but eventually your still going to reach a point where the disks can’t keep up and the cache will fill up. Not to mention the unpredictability of read data can lead to the read cache being constantly rewritten to causing it to be less effective as well as shortening the life of the cache drive.

2 Likes

You bring up a few interesting points.

SLOG is a “separate log device”, its used exclusively by sync writes. If it fails — it’s harmless. The file system will see that the device is not accepting new writes, stop using it, and degrade the pool. You will suffer a period of worse performance while you are replacing the failed SLOG, but no data loss will occur. The only opportunity for data loss will be if power loss occurs, and then slog fails during boot. You might lose last few transactions, but that’s extremely unlikely with redundant PSU and and a UPS. And yet, still harmless in the grand scheme of things.

This is far into the future, seeing how 3TB sized node receives 300kBps average traffic today, but let’s say you do get 50MBps random traffic from STORJ customers to your node. Then I’d argue the raid array with caches will handle that workload much better than a single HDD: Random access performance of a single drive is horrific. So you pretty much need some sort of caching to absorb that. Which brings us back to – do you want to buy 5 sets of cache drives or one. (And if you have multiple nodes on multiple HDDs—they share traffic anyway, so it’s no different in terms of workload compared to an array).

Absolutely. But since newer data has higher probability to be retrieved than old data, the cache will still be helping. Actually, it does not even have to be too huge - just to fit the metadata (look up tables (directory structures, file bitmap, what have you). There is probably not much benefit in caching actual data.

For the life of a cache drive—I don’t care much about it, they are disposable. If for winning races I have to use cache—I have to use cache, regardless of whether the storage is an array or multiple nodes.

Tangentially, for a lot of users, (if not all) hosting storage nodes makes sense if you have free space to share, to somewhat offset costs and feel good about contributing to the project – that was the original intent. So this pretty much rules out single drives: I have no use for single drive volumes, other than storj. So array is not really a choice.

If that is the case then cool, I was not aware that this could be done without any data loss. Not sure how that works though. I thought if there’s data waiting to be written to the pool and it fails you would loose that cached data.

Edit:
So I looked this up and apparently if you loose a SLOG you loose the intent log which then starts being written directly to disk, and the data lost depends on the sync timeout / interal. Although this could be negligible, I’m not sure how this would effect node operation.

Maybe… hope it’s not that far though. Growth of tech usually tends to happen exponentially. My largest node is about 11 TB at the moment and usually sustains between 4 to 8 Mbps combined IO all the time. I also have the advantage of having a few IP’s on my data connection so I was putting a little more load on the array. Althought it did usually handle it ok, disk usage hovered around 40-50% and heavy IO times I mentioned earlier would often peg it at 100%. So sure, it would definitely benefit from cache drives, but for how long especially if people have pools of 60+ TB? I still think it would probably be better to use pooled SSD’s for the database files since that moves all that IO completely off the arrays. Plus, that method can be used for either arrays or single disks.

From my understanding (not a database expert here) flat files don’t contain lookup tables or metadata and the whole files needs to be scanned, written to and then saved so all the database files would need to be cached anyway. As for caching the actual data no, this would be pointless. Sure, if you only have a 3 TB node and say a 1 TB cache, but once you have a lot of data on it, that mostly randomly read data will chew right through a cache drive.

1 Like

This should probably be moved to a new thread. I assume a mod would have to do that? I’d be curious to know if anyone has compared these two options on larger arrays. If anyone knows of any other discussions about this please point me in that direction!

Correct, the performance would degrade (if it would not—you would not have had the SLOG in the first place). But since this degrades the pool – you will get notified, and fix it by replacing the failed device.

Yep, the idea being that this removes the need to actually write out the synchronous transfers right away, stashing them instead to the fast device, just in case, to satisfy the synchronicity promise. It’s never read from under normal circumstances. Only after abrupt power failure, which is rare event in itself. Also, the amount of data is small (transaction group size, which is about how much data can the array absorb in 5 seconds), and use of (often much) larger devices prolongs their endurance. Frequently small Optane drives are used, that have ridiculous amount of write endurance. Ultimately, I would not worry about failing slog.

This is interesting to know. I was wondering how does it scale with the node size and ip count.
On the other end of it—the use cases where people attach an USB drive to Raspberry Pi will likely get ruled out pretty quickly—these setups simply won’t be able to keep up, losing both upload and download races.

Perhaps, yes. Anything that offloads IO from the data aray would be worth doing. And these databases are not really critical—the downside of losing them is purely cosmetic. I wonder if it would be possible to completely turn them off in the first place (and keep whatever session—specific housekeeping in ram). Or keep databases on a ramdisk.

I agree, unless that small bias towards newer data being used more frequently has any real world impact.

1 Like

This I was not aware of. Although I dabble in programming here and there I am by no means a programmer and am not familiar with how the Storj software actually works on that level. I was under the impression that databases are critical as databases typically are in most cases. If loosing any or even all of the database files doesn’t cripple the nodes and cause data loss than I suppose that negates my concern about that.

This might actually be a really good alternative and I like it but I’m not sure it would give much advantage over the other options since (for the sake of system reboots) databases would still need to be stored on faster drives to load into and out of RAM or it would still take some time to read from disk after reboots… unless you just don’t keep them. Again, I personally don’t know what effect this has on the nodes. I would be totally open to having an option for nodes to use more RAM in order to reduce disk IO if something like that was possible but of course this would be something Storj would have to impliment.

It pretty much seems to scale linearly both for node size and IP’s. The few times I’ve calculated the egress it was always around 10-13ish % of the full node size in egress every month no matter the node size. Nodes on separate subnets will each get about the same data once their vetted. There not always equal at the same time but average out to be about the same overall, although I can’t speak to different geographic locations.

As for increased disk IO, I would imagine that scales the same way and the disk reports seem to agree. One thing I’m curious about is if a single node will have less disk IO for the database files compared to multiple nodes using the same amount of bandwidth.

Databases are used for statistic, if you lose them, you may start without databases at all and they will be recreated. You will only lost your history.
But losing data from blobs more than 4% will lead to disqualification.

1 Like

Good to know, thanks for clarifying that.

1 Like

Then wouldn’t it make sense to make a database backup let’s say every start of a new month to keep most of the history?
I think it was under consideration to make database backups on Storj DCS if I remember correctly. Wouldn’t that be amazing to make database backups to Storj DCS before or after every node update.

I did not see such a feature request on our roadmap or on GitHub, so it perhaps was only an idea, but no real action items.
However, you may implement it right now with Duplicati (it has an integrated scheduler and web UI), with restic or HashBackup (you will need to configure a cron job).

I think that would be amazing.

1 Like

Since the database files are only for node statistics and are recreated by the node if lost I don’t really see this solution being relevant to the stated issue it’s trying to solve.

I understand it that the history remains lost once it has gone which is all the previous months and includes the earning data etc. .

1 Like

Point being, backing up / restoring the database files won’t have an effect on node downtime or disqualification so I doubt it’s something Storj is overly concerned with.