Distributed architecture for SNOs for high availability of storage nodes

Cmdrd · May 7, 2020, 8:24pm

I use CEPH for my Storj nodes and while it isn’t quite as quick latency wise as local storage using the same disks I am sitting at approximately 35% upload success rate and less than 0.01% audit failures. Thanks to that storage resiliency the only time I take downtime is if I have to reboot after patching the VM hosting the storagenode. I have run through over a dozen cycles of patching hosts while the CEPH storage remains available just with degraded resiliency when a host is rebooting. I agree with @Alexey in that there are solutions to cover areas of resiliency.

This was a question I asked at a town hall last year because for SNOs with larger amounts of infrastructure it would be valuable, but really only at the application layer as all other layers can be abstracted and made more resilient separately. Implementing stateful HA applications with a shared back end is not easy and adds a lot of complexity to maintain for Storj Labs from a code base perspective.

I like the idea but the vast majority of infrastructure resiliency can be done without having to modify the storagenode and done at the other infrastructure layers.