Is the Backup of a Node (and its contents) reqired?

hoarder · February 10, 2020, 1:20am

It’s not enabled, but it’s present in requirements. It depends on the window size, but still, 5 hours a month sound terrible and basically guarantee that storage node will be lost during any prolonged outage. 60 hours a year is a lot better, but might not be enough if the idea is to keep nodes as long as possible.

But the main issue lies with data storage. Graceful exit is impossible if even a single bit of data is corrupted. Node failing audits will be disqualified. So, a node is essentially doomed if HDD storing it develops even a few bad blocks. SNO can’t do anything about it even if there’s a spare drive.

cdhowie · February 10, 2020, 5:37am

The Windows NT kernel supports mounting volumes on paths so drive letters are not an issue.

However, if you want unlimited expansion, I would suggest looking into ODROID-HC2 which is what I’m using. It’s effectively ~$70 for a single-bay NAS (cheaper per-bay than any NAS I can find) and it’s capable of running the storagenode software directly on it. It’s as expandable as far out as your power and Internet connection will allow – just buy more units and attach them to your network.

Alexey · February 10, 2020, 8:48am

Yes someone tried to restore from the snapshot, the node was disqualified after a few minutes.

BR0KK · February 10, 2020, 11:15am

OK those checks/rules seem to be verry harsh towards the nodes and the contained Data…

What happens if i take the node down for an hour and then restart it. How does storj react after the node comes online again?

John.A · February 10, 2020, 1:39pm

There are No penalties for downtown.

BR0KK · February 10, 2020, 1:56pm

Wouldn’t the recovery of a node also considered donwtim. Lets assume i have a node that goes kaput but i have a backup in place that automatiallc detects said failure and virtualy boots the node via backup files (Veeam, Storagecraft and others can doo this and its fairly quick)

The node would go online again in lets say 3 hours (rebuild time?). In this scenario nothing would have happened to the files on the note and it would rejon the network?

Downtime = no penalty atm!. but in production the node operator gets punished when the node is offline for more than 5H/Month?
Data corruption = node operator gets punished… restart from scratch…

Then again … the harddrive failures are more common than hardware or VM malfunctions

Pentium100 · February 10, 2020, 2:05pm

Consider this scenario:

Node is working correctly.
Node is backed up.
Some files are uploaded to the node.
Node blows up.
Backup is restored and the node is restarted,

The restored node will not have files uploaded in step 3 and will be disqualified for failing audits.

Yep, both requirements imply datacenter-level reliability (with no “scheduled downtime” or “our server blew up, here are the backups from this night” options as well).

donald.m.motsinger · February 10, 2020, 6:30pm

Who else should get “punished” in your opinion? Storj is not a charity. Costumers expect their data to be safe and that they can access it at any time.

One more point to consider. When you lose your escrow, Storj does not benefit from it. The money is paid to other SNO’s for repair traffic. So over the lifetime of your storagenode you also got some of the escrow money from other SNO’s. That should offset some of your loss.

BR0KK · February 10, 2020, 7:22pm

You got me wrong, i do not have anything against penaltys for SNOs. Im just asking questions so i fully understand what I’m signing up to if i start a node.

I’m comming from bitcoin. With bitcoin mining you have no stakes in the coin itself. You have hashing power and in case of failue you loose the hashing power over the time you are not mining. Your already earned coins are safe and no one but you is harmed …

Also woudn’t it be a nice feature to have some form of reintegration routine that checks the data on the restored node and repairs the missing/ damaged parts by “asking” other SNOs to reshare said data. Saving Time, Saving download volume (quotas), making the network more reliant?

This clould make it easier to keep nodes running as intended.

With storj we habe a stake in the data we store for our customers (or the network).
If something goes kaputt i need to start anew and redownload everyting from scratch.

I want to keep my node and data as safe as possible

BrightSilence · February 10, 2020, 7:27pm

Not really. The big difference is that datacenters need to ensure no data is ever lost. Because if they don’t important data is lost. For a node operator it’s essentially just a temporary reduced payout. The risk is much lower and easier to overcome. If your node survives 15 months which most well managed nodes will be able to easily do, you already get half your escrow back. I personally consider the other half lost anyway. I won’t get that until I stop, which I see no reason for doing right now or any time soon.

Sure a new node would take a while to get vetted and back up to the same kind of income, but I could definitely live with that. Add to that that if you’re running multiple nodes on multiple hdds only one of them would see reduced income and things would even out even more.

In short, you’re just not dealing with the same kind of losses that datacenters do, so it’s ok to take more risk.

fmoledina · February 10, 2020, 8:09pm

[User-error anecdote time]

I recently did some downtime work re-organizing some cabling near the home server. Upon rebooting, my ZFS pool did not remount the storj data folder that’s bind-mounted into the container. So my node started with an empty storj folder and was disqualified within a few hours. This is a node that was running since April 2019 . So I fully understand the responsibility of the additional complexity that RAID/HA entails, and although I have snapshots of previous states of the data folder, I understand that they’re irrelevant now that the node is DQ. I also understand that this is a user error and that it’s my choice to use ZFS RAID for my node. As far as Storj is concerned, the outcome of this scenario is akin to the losing an entire drive all of a sudden.

Having said that, I definitely support a partial graceful exit. It would be nice if we could recover to a previous backed up state and have the network scrub the node and repair missing data on our nodes. This network traffic should be payable by the node but at least there’s a chance to recover the node and not write it off completely.

The biggest drawback is that SNOs with a blown up node have to start from scratch with a new identity and go through vetting with all satellites again. There’s no credit for reputation built up from the previous node on the same machine.

So given the above, I think the SNO strategy is to run multiple nodes with minimum sized drives (i.e. <2TB each) to reach the target total space. This would allow the risk to be spread out across all the nodes given the all-or-none allowance for data integrity.

Thoughts?

Vadim · February 10, 2020, 8:23pm

I ran 14 nodes in /24 network. each HDD is 1 node. when Egress was tested, i had up to 150Mbit in average avery day. So it is very OK to run node on each HDD.

Pentium100 · February 10, 2020, 8:40pm

What I have a bit of a problem with is that it feels like the SNO is being punished for following recommendations.
In some ways the Storj requirement is even more strict than for a data center. Depending on a server it is possible to lose some files etc and the client would only be mildly annoyed, but here losing a few files may disqualify the node.
Even better is that if I notice some bad sectors in the hard drive, I am prevented from doing graceful exit. While the node may survive failing a few audits in normal operation, it cannot fail a single audit during graceful exit

So, I take it that IP filtering does not actually work then? Or does it only aggregate nodes with the same IP?

Vadim · February 10, 2020, 8:46pm

it agregate for each file on ingress, but no need to agregaate on egress.

Pentium100 · February 10, 2020, 8:47pm

So, how much data do your nodes have?

Vadim · February 10, 2020, 8:49pm

it toke a long time, about 25 TB is hole size, about 1/3 of that.

BrightSilence · February 10, 2020, 8:49pm

I don’t think this part is actually true. There was some ambiguous language in the graceful exit design doc, but I’m pretty sure @littleskunk mentioned in another topic that you can fail a few, just not too many. Of course if a disk that has already started failing needs to finish the whole graceful exit and survive long enough, that may still be a problem.

I understand the recommendations. But it depends on the hardware you have. If you have hdds up to 4tb, you’d literally be making more money if you want two separate nodes. And a raid setup would cut your potential profit in half. Only when you get up to more and larger disks would a redundant setup make sense. Otherwise better to make more money and take a little more risk. That’s not being punished for following recommendations, it’s actually the better option. Now if you have 3+ hdds of 8+TB, that equation may change. But I don’t think that’s the most common situation.

Pentium100 · February 10, 2020, 8:51pm

Interesting, as my node only has 4.6TB with very few failed uploads. Looks like the aggregation does not always work and running multiple nodes in the same /24 does get you more data.
I wonder if the same is if the nodes have the same IP.

Vadim · February 10, 2020, 8:53pm

when you have 1 file agrigation is working, but if you have more than 1 file to upload, then only agreagate tha 1 piece per file in my node. in future when clients will be lot of, and lot of files, agrigation will be not a problem at all.

Pentium100 · February 10, 2020, 8:59pm

It seems that you get more data (in total) than others, which is something Storj said would not happen (multiple nodes in the same /24 would get the same amount of data as a single node).