Node disqualified?

one word, suspension
Not disqualify it for a few hours there being an issue. We are all single person operators, we cant babysit these things 24/7. Its the fact that there is no path to fix the issue. And I have had reliable nodes from V1 of the network and from day 1 of v3 implementation. So whats the point of a reputation if it means jack squat.

there is realy lot of posts, where is written that NFS is bad thing. And people have lot of problem with that.

Well, that’s the problem with one word answers, they skip over the details. What would you suggest are the conditions to exit that suspension, now that the satellite knows the node has lost data? Only a fraction of data is ever audited, auditing all data is more expensive than disqualifying these unreliable nodes. And additionally, the feeling you’re having right now is causing you to make damn sure you’re not going to make this same mistake again. Yes, it might also cause some SNOs to not try again at all, but the supply side has never been a problem for Storj, so for that SNO leaving, several others will join and take over.

So how would you go about ensuring that a node actually has all the data back?

That is why I left my old node in tact so I could revert back to it. I was not expecting that being able to roll back would not be an option, you know like it is with everything else. I am also loving how the first thing this community does is go straight for well you should have read the forums. I did follow the official instructions from storj and that should be enough. So yes it is on them. And the not reading the forums, Ignorance is no excuse for the law is a terrible policy. Perhaps I should also have searched the forums for every version of software, every piece of hardware in my entire network. I had no reason to belive that NFS would cause this type of issue with this specific piece of software. And again I am not blaming storj for my NFS not working. Its the fact that there is no way to fix it. OOPs looks like you have a flat tire, go buy a new car is not acceptable. Again and again I think this project is a great idea but this implementation is flawed in ways that should have never made it out of alpha. But nope all of this is my fault because I did not know that a node could be disqualified for something so small. Again a suspension (aka no new files until data is restored and verified) is not unreasonable. I can see how this issue could happen even outside of my failed NFS experiment. Another option you ask? Well if the node software detects that there are files missing or corrupted it shuts itself offline until you fix the issue? Wow who would have ever thought of that feature in a piece of software that requires data integrity? Yes I am completely wrong for expecting that level of fault tolerance.

You can’t really do that… because the data stored on the network changes often.

That is actually an easy one. Check sums. Its how ZFS checks data integrity within its raid. And the node operator would not receive any compensation for files that have not been verified after such a disaster recovery. And like I had suggested earlier some fault tolerance in the software to realize that something is wrong and turn off the node till it is fixed, then the satellites would only detect that the node has been down for a few hours. And if I am not mistaken it takes a week before you are dropped for the node being down. Hell when I first saw that there was an issue like that I was fully prepared to write a script to check for that issue with NFS to add that exact functionality. But when I learned that this is a permanent issue and I couldn’t start my old node and let that run while I fixed this issue has left me pretty salty.

If that were the case then I should not have been able to use rsync to migrate my files to a new file server to begin with. It should have been as simple as syning the changes from the new location from while it was working to the old location and starting the old node back up. If not then every node would require it to be a perfectly running set of hardware and never have any issues, which is just not practical.

Please, elaborate, how the storagenode can figure out, that the data folder is missing?
The database was on the missed data folder too, as a config file…
When the data folder disappear, the storagenode failed and restarted by the docker.
Now it doesn’t see either its config file or the database.

From the storagenode’s point of view - it’s a clean setup with an empty folder (mountpoint) and it’s started from scratch…

The suspension mode is a good suggestion from the SNO point of view, but a very bad from the satellites’ and customers’ points of view:

  • your node doesn’t have requested files, but they can be requested by customers and they got an error “file not found”;
  • it can trigger the repair job, because all pieces on your node will be marked as temporary unhealthy;
  • if the repair job has been triggered, the payment for repairing would be taken from the satellite’s operator pocket, not from the held amount of the suspended node;
  • the suspended node will lose the repaired pieces, because they will be stored on more reliable nodes;
  • the satellite operator will pay for inaccessible data from its pocket until you fix the problem or until a week is gone.

The current implementation of suspension will be used only if your node would answer with unknown error. It’s not so great number of nodes as if were taken into account the “no such file” errors.

2 Likes

As this happened to @dambergn as several more,

I have not read the node startup instructions for some time but maybe it should state that only directly attached drives are to use and not Network connected ones.

Ok yes so I cant come up with a perfect implementation for the suspension off the top of my head. I agree it needs some work and I am sure its something that has already been considered before. However if you have a node that is being requested files that it responds with ooh I dont have that file, then after so many requests or so many requests in a row it realizes that something is not right and turns itself off. And if the config is stored with the data and is then missing perhaps have it pause during its start and restart and ask the the SNO if they would like to create a new node? I would have been …NO and began investigating what was wrong. And some of these are just off the top of my head, with more time to think about it I could probably come up with more or better solutions.

Here is a suggestion for the suspension case. If a node is in suspension the clients automatically get sent to a new node as if the node has been offline for a week. While the SNO may loose some contracts while they are repairing their node they wont loose all of them and it gives them incentive to restore the node as quickly as possible.

By the way - another workaround against the disappeared storage is store the identity alongside with data. Without an identity the storagenode will definitely fail. Even if it would be restarted by the docker, it will still fail.

I see you’ve moved up to 2 word answers. But… either the satellite does this, which is basically doing an audit on all pieces, which I already mentioned would be more expensive than disqualifying or the node would do it itself, in which case, why should the satellite every trust that result? I would definitely compile my own node with modifications to always say everything is fixed!

I agree with this part though. I don’t think it’s right for the node software to assume the node is new and start over as a new node when there is no data in the storage location. Instead it would be better to have separate setup commands that need to be run prior to the first run.

But in the end, you did a daring procedure that you yourself called an experiment, without doing the appropriate research beforehand and without monitoring the logs and audit scores afterwards. You knew there was risk involved, but didn’t take appropriate precautions. Just a little bit of research (or common sense) would have told you that going back to the previous hardware wouldn’t be an option after the new node has been on, as it would immediately start receiving data that the old node doesn’t have. At that point disqualification would have quickly become inevitable to begin with.

There are things the software can do better. But you can in here blaming Storj before owning up to or even acknowledging your own mistakes in this process. I’m with you if there are good improvements to be made to the software. But I’m not seeing great suggestions on that yet. And while blowing off steam might feel nice, it’s not going to lead to productive results. So now that we’ve been over what went wrong, lets focus on what both sides can do better.

1 Like

according to documentation the identity needs to be in the .local/share/storj/identity/storjnode. Which is where mine was stored, but I would have opted to have that in my storj folder as to keep everything together, thats where I have my backup of my identity to begin with. So yes that would make a good solution for what I have suggested. But its not set up that way by default nor do the setup instructions suggest something like that, so the node is kinds setup to fail eventually from the start.

Yep I use this trick as well, but unfortunately both workarounds don’t really work if the storage disappears while the node is already running. Unless that leads to a FATAL error that restarts the node. That may be the case, I’ve never been in that situation, so I don’t know. Either way, I like the idea of separating out setup from a normal run, so that the normal node run wouldn’t start if there was no config.yaml there. And perhaps even monitor the folder to see if it disappears. I do agree that mounting issues are a little too often the cause of nodes failing. It’s usually user error, so disqualification isn’t entirely unreasonable, but it would be better if it could be prevented.

Could you please point out where it says that? Because it’s not true at all. My identity for all three nodes is on the HDD that also has the data.

I havnt seen where it says that either

I followed the official instructions in the documentation for migrating a node, So it is my fault that this is a known risk when doing this but it is not mentioned there. I dont know about you, but I usually turn to forum when there is a problem or there is not official documentation on how to do something. But there was documentation and at no point did it warn about not being able to go back to the old node (which I think would still be possible) or that getting a node disqualified for just a few hours of a miss-configuration running. Again I own my mistake with NFS, even though I have used it in similar ways for other projects without issue so I had no reason to doubt it in this configuration at the time. Its the fact at how easy it was for this to break in such a way that it is unrecoverable. And even with the current way things work, even if I had known about these issues and risks, there are still many other ways that this could have gone wrong and my only option would have been to leave the node on a file server that is in desperate need of a rebuild. And while everyone has been asking suggestions of me I have had no one offer a “well if you do this with your node first you can recover if something goes wrong” or “I would have done this differently even though the documentation said to do this”. What I am mostly getting is “You should have done more research” and I did in the official documentation. Forums are a good resource but I had no reason to believe I didn’t have all the information I needed to give this a try. I don’t know perhaps in hind sight I should have come here first and said “Hey this is what I am doing, should this work?” But all the search the forums or just google it you would get from that. Usually why I don’t primarily use forums, they are kind of a last resort for me. Like now.

I am using linux by the way

but the migration itself wasn’t your nodes problem.

Yeah, it might help if more things were warned about in the documentation…

Here is a question, are we able to set up multiple nodes now or are we still restricted to just one node? Because if I could have made a second node, a smaller one I could have done a test with that.

I mean even with everything I have learned from this, Starting over and the held ammount now lost because of this makes me reluctant to try.