Is the Backup of a Node (and its contents) reqired?

BR0KK · February 8, 2020, 10:16pm

Cool thank you for clarification.

Noob question… Is docker a requirement or is there a direct solution for installing a node under Linux.

I’m running an ESXi host and docker-ing inside a vm seems overkill? (like nested virtualisation aka. VM inside a VM?)

Mark · February 8, 2020, 11:28pm

As far as I know, direct install is not supported or documented at this time. I’ve only heard of one user doing it and I believe they had to build storj from source and had to write their own script for startup and updates. Docker is much more light weight than full Vm from what I hear.

BR0KK · February 9, 2020, 12:01am

Yes I know that docker isn’t a real vm, dockers use the underlying ID and share library’s and such… …but does it make sense to run this on a hypervisor, wouldn’t it be better to go bare metal for optimal use of resources.

Seems silly to ask but:

Metal - - > ESXi - - > vm - - > dockered applications?
Metal - - > Linux - - > dockered applications?
Metal --> Linux + applications?

Isn’t 3 the least complicated solution especially if the operator should keep the system simple (Kiss)

For a Noob and a strong “cklickibuntiwindowsdesktopGuiuser” like me it’s kind of hard to wrap my head around?

Sometimes Linux gives me headaches since it seems to follow the concept of “why easy when you can have all the fun of bering overly comlicated”

Thank you

Mark · February 9, 2020, 12:09am

cdhowie · February 9, 2020, 2:00am

It does after pieces are deleted, which happens pretty regularly on one of my nearly-full nodes. Check your storagenode log; you’ll probably see at least one upload today.

Mark · February 9, 2020, 2:14am

I do get some deletes but my disk space remaining is currently negative 42 gigs.

SpyShadow · February 9, 2020, 4:00pm

I will give the multiple nodes option a go, but if I run into problems, raid option is the last option I have. I had to figure a way to have more then just 26 drive letters (Windows). I am always thinking ahead, I want unlimited expansion, hitting a road block is never a good thing. That is not a big number at all simply because I am already have 13 HDDs in the box. Those are under a single drive letter for my own personal use luckily. I could mount the drives under folder paths by disk manager, hopefully that works. As for auto updates, I will setup a script that will restart all nodes at midnight everyday. All nodes will use the same exe from the first node install. SrvMan App makes it easy to create more services.

hoarder · February 9, 2020, 5:55pm

I wish we had a way to completely recover nodes from database corruption or minor disk failure. Maybe a process to restore db from the satellites (why do we even need it on nodes?), hash check all the stored pieces and send the data to the satellite?
Then apply some non-lethal penalty proportional to amount of data lost and get done with it.

I feel like everyone will benefit from it - satellites will not have to rebuild the data and pay for repair bw, SNOs will not get their nodes disqualified after an unfortunate accident, SNOs get an incentive to report data loss right away and thus network is never late to reconstruct and redistribute the data.

Pentium100 · February 9, 2020, 6:21pm

I suggested that a few times. I guess implementing it would be too difficult or not worth it. Which makes the official recommendation of not using RAID even more strange.

Alexey · February 9, 2020, 6:31pm

You can use the RAID, however, the simple setup: one storagenode - one HDD is much simpler for everyone. You do not need to learn something or spent money.
But - if you want to, you can use a RAID, we just can’t recommend it, because it’s more expensive for the setup and support.

Pentium100 · February 9, 2020, 6:39pm

I guess I am used to different types of recommendations an requirements.

In this case, however, we have:

No way to back up the data.
Pretty much no tolerance for data loss.
No “partial graceful exit” - one bad sector in a 10TB drive during it and you’re done. Even if normally one failed audit would not disqualify the node.
Hard drives that do occasionally develop bad sectors even if they do not just fail completely.

So, the combination of the recommendations and the requirements is setting up SNOs for failure. The single-drive setups will either be disqualified during audits, or, if they somehow manage to last 16 months or whatever the requirement is, fail during graceful exit.

KernelPanick · February 9, 2020, 7:08pm

This is a rather controversial statement I haven’t seen before, but the fact seems to me - It’s most profitable for Storj for us to follow those recommendations. It also happens to be the easiest and cheapest setup for SNOs to follow. So long as the network performs up to par. The network does the repairs, and StorJ reaps the benefit of collecting escrow from prematurely failed nodes.

It’s a tough balance to deter cheaters & encourage good behavior. Though i don’t really know what a better answer would be.

Alexey · February 9, 2020, 7:19pm

Unfortunately Storj do not have any benefit from failed nodes, the escrow is not enough to cover the cost of repair:

The reason for suggestion to have one node per HDD is different - we want to have a lot of SNO, and it’s easy for most of users just setup a one storagenode per HDD.

Pentium100 · February 9, 2020, 7:53pm

And yet, if there is a problem and I lose 100MB out of 5TB, I cannot do graceful exit and upload the remaining data back to the network. I have to just shut down the node and allow the network to repair the whole 5TB. If I don’t shut down the node, I may just be disqualified later with the same results - network repairing the whole 5TB.

It is the same with backups. Let’s say I run a node on a single drive, but back ip up daily and have 5TB of data. The drive fails and I cannot just upload the data from my day-old backup and instead the network just repairs the whole 5TB instead of whatever files were uploaded since the last backup.

I agree with wanting to have more SNOs, but it looks weird to me when the recommendation is for a simple setup, but the requirements are for datacenter-level reliability, which the simple setup is pretty much guaranteed to not have.

Alexey · February 9, 2020, 8:01pm

This is discussed many times already. The HDD can work without issues about of two-three years. Usually they can work longer. The timeframe is 15 months, most of HDD should survive.
However, it’s up on SNO to decide - run a one node per HDD or have a headache with RAID and HA setups.

We can help with both. But we can’t demand this from SNO

Pentium100 · February 9, 2020, 8:08pm

I guess that’s where my understanding of recommendations is different. The recommendation should include RAID, UPS etc, but if people want to cut some corners they can. If they do not get lucky and the drive fails, well, it’s on them.

Now if, say, I didn’t know any better and followed the official recommendations, then the drive failed (or some problem wiped out 100MB of files), I would feel like I was being punished for doing what I was told by the people who told me to do it.
Of course, anyone who knows any better would read the requirements and see that the recommended setup is not going to be enough, so they would prepare accordingly.

hoarder · February 9, 2020, 9:21pm

I tried to calculate the monetary consequences of early node disqualifications, but it depends on too many variables - time to dq, vetting period, amount of data stored and added every month, egress traffic activity, the way other nodes holding the erasure coded pieces left the network, etc. It varies a lot between extremely positive and negative numbers, so in the end I would say network neither gains nor loses significant amounts of money.

But if network does not benefit from taking escrow and wants as many users as possible, why there’s no simple way to protect nodes from disqualification?

One could run zfs or ceph - something that will verify data integrity upon read and have redundancy available to correct the error if needed. But this is way above a user with a raspberry pi in a closet, especially if people are not encouraged to use raid to keep things simple.

This applies to uptime dq as well, if you’re required to keep downtime below 5h/month, a single prolonged downtime has an ability to kill your node. And if storage nodes are expected to be run on residential connections in non-datacenter environment and managed by untrained people, not a team of professionals, such downtimes will inevitably happen.

Alexey · February 9, 2020, 9:33pm

The disqualification for downtime is currently disabled: Design draft: New way to measure SN uptimes
you can run zfs, ceph, cluster, RAID, if you know how and accept the costs of support.
We can’t suggest such setups to everyone. Even RAID is not a simple setup.

Pentium100 · February 9, 2020, 9:43pm

DQ for uptime is currently disabled, but it’s in the specs, so I expect it to be enabled when the monitoring works properly. And 5 hours is a bit on the low side if people expect to go to work etc and leave the node unattended for 9 or so hours (not every workplace allows you to take some time off to fix the node) or days when going on vacation.

The fact that graceful exit fails on a single damaged piece (so, even more strict than regular audits) does not match with the statement that repairing pieces costs more than the escrow. In that case wouldn’t you want to get as much data as possible from the damaged node?

All in all, the difference between the recommended setup and the expected reliability from that setup is what’s the main point here. With the recommended setup the expectations should be lower.

BR0KK · February 9, 2020, 10:13pm

Yes expecting raid and many more advanced solutions from new non professionals is way to much to ask.

But let’s be reasonable most of us are somewhat professionaly involved in tech… This project isn’t aimed at non pros?

I don’t think a rpi with an USB attached drive is a good idea anyways…

Has anyone ever tried to backup a node and restore it?
I mean an image based agent less backup for a small node. Restore should not take that long and if the machine goes online again will the network accept its contents!

There needs to be a re integration routine… Is that planned for the final release??