Returned to find my node offline and restarted to no avail

Rapturoso · September 3, 2020, 3:56pm

“C:\Program Files\Storj\Storage Node\storagenode.exe” exit-satellite --identity-dir “C:\Users\Smith\AppData\Roaming\Storj\Identity\storagenode” --config-dir “C:\Program Files\Storj\Storage Node” --log.output stderr

Vadim · September 3, 2020, 4:02pm

it looks like more for CMD not Powershel

Rapturoso · September 3, 2020, 4:04pm

What do you mean more for CMD ?

Vadim · September 3, 2020, 4:05pm

cmd is command line interface, just push start on windows and type cmd

Rapturoso · September 3, 2020, 4:38pm

I’m fully aware that CMD means the windows command prompt. I wasn’t asking for an explanation of what CMD means. What, however, I was asking for was an explanation of what you mean by “more for CMD”?

Also if my use of Powershell was a problem I’m sure Alexey would have noticed and told me not to use it, seeing as I put a big screenshot snip of the problem up there for everyone plain to see.

Rapturoso · September 3, 2020, 4:53pm

How long until a new version of the Storj node software for Windows is released and active on my node? You say to wait, is this because the new Storj node software has more reliable GE? Did the 1.10.1 Storj node version have bugs that corrupt perfectly healthy data that’s stored on a node during GE? I’m just wondering why you tell me to wait for a new version, an explanation of why would be appreciated.

baker · September 3, 2020, 5:13pm

This is an excerpt from this post regarding the Graceful Exit bugs that are being worked on.

Known issues we are working on

Graceful exit disqualification
Most of the storage nodes can finish graceful exit just fine but there are a few edge cases and some storage nodes are getting disqualified for bugs. Here are the bugs that we currently know (order by likelyhood):

The storage node will submit graceful exit success in one message at the end of a batch. Graceful exit failures are not batched. The storage node will submit them one by one. Less powerful systems / routers can get overloaded by the number of connections. This creates a cylce. In the next batch the storage node will fail even more transfers which will increase the impact of this problem until the storage node finally gets disqualified for too many failures.

The storage node has problems with corrupted pieces. The storage node should identify the corruption, report it back to the satellite and continue. As long as the failure rate is low graceful exit should still be succesful. For some reason the storage node identifies the corrupted piece but something is messing up the entire batch. The storage node doesn’t continue as expected. We are working on a fix.

Graceful exit is transferring the pieces in a specific order. It starts with pieces that are close to the repair threshold. These are most likely older pieces and they have a higher likelyhood getting corrupted over time. This means a storage node with a overall failure rate of 10% will get most of these failures at the beginning of graceful exit. The satellite is judging after each batch. The storage node might get disqualifed early and has no chance to show the low failure rate at the end of graceful exit. We are working on a fix.

After each successful batch the storage node reports the results back to the satellite. We have a connection timeout in place to prevent storage nodes from getting stuck but no retry. This means the results are getting lost and a retry will be triggered. Fix is incoming.

Conclusion:
Lets say there is a 50% chance that you might get disqualified when you execute graceful exit. Would you leave the network anyway or would you stay and wait for a fix? If you would leave anyway you can risk it. The overall success rate is high and most likely it will work. If you see any kind of audit errors in your logs (corrupted or missing pieces) I would recommend to wait for the fix.
If you have a less powerful router I would recommend to reduce the batch size to reduce the risk of running into issue 1.

Diazole · September 3, 2020, 5:24pm

I’m fully aware that CMD means the windows command prompt. I wasn’t asking for an explanation of what CMD means. What, however, I was asking for was an explanation of what you mean by “more for CMD”?

Also if my use of Powershell was a problem I’m sure Alexey would have noticed and told me not to use it, seeing as I put a big screenshot snip of the problem up there for everyone plain to see

Did you read the bit above that command where it says (in cmd.exe )? You can use powershell if you wish, but not using that command.

You have said that your node has been up for 7 months but the requirement in the guide says 15 months minimum.

Actually no that’s not what happened at all. The disc just lots connection for some reason without being physically disrupted. Goodness knows whats happening here.

You should probably try and find the cause of your disc losing connection because that’s not normal. A disc that has randomly lost connection to its host device is going to cause all sorts of problems.

Rapturoso · September 3, 2020, 9:55pm

Actually that’s incorrect. It’s not a minimum of 15 months, it’s been temporarily reduced to 6 months so my node is totally capable of performing a graceful exit on it’s 2 remaining satellites.

Also the fear that the disc was losing connection has already been addressed and disproved in a previous reply to Alexey.

Rapturoso · September 5, 2020, 3:12pm

I’m bumping my question as I have not received a response. When is the new version of the Storj node for Windows going to be deployed that resolves all the known bugs that unnecessarily and artificially cause failures of graceful exit ?

deathlessdd · September 5, 2020, 4:33pm

The update has been out for 10 days your windows node should have updated if the updater service is running.

anon27637763 · September 5, 2020, 4:43pm

Problem 1:

Problem 2:

Problem 1 was caused by the SNO, own it.

Problem 2 is the result of some faults in implementation of a feature that allows SNOs to receive income in a process which generally runs at a loss to satellite operators… meaning it costs more for the Storj network to move data from a GE node than the total held amount.

It’s likely that your node will also be DQed on Saltlake soon. That’s what happens when one accidentally ejects the wrong disk without properly shutting down the service trying to acces that disk.

Stefan might not DQ, because that satellite hasn’t really been very active since March 2020… and most of the data moved by May 2020.

There’s no appeal process to DQ… it’s algorithmic. The requested data were not found and the node was running. Therefore, the algorithm properly DQ-ed the node. If the SNO isn’t careful about the storage node’s storage, that’s not the Storj Network’s problem. There are numerous possible methods to setting up storage locations. The details are left up to the operator.

As far as I understand, the new version being rolled out has features that should catch operator errors such as accidentally ejecting the wrong disk. However, if one is concerned that such feature is not implemented properly, I’m fairly sure Storj will look at the github PR when you contribute it.

Rapturoso · September 5, 2020, 7:23pm

How would I know if the updater service is running? The node web gui still reports 1.11.1 is running so no update has happened here. Unless of course the windows version is somehow lower than the reported version that the web UI was urging me to update to a few days ago.

Vadim · September 5, 2020, 7:33pm

1.11.1 is the last version

Rapturoso · September 5, 2020, 8:10pm

Own it? Excuse me? How rude! I certainly don’t need to be disrespectfully spoken to in this adolescent and disrespectful manner. That’s hardly a productive answer when I have already covered the way in which this happened. Given the measures put in place, the Storj service should have noticed that the file it writes for lost storage mitigation was not accessible, therefore temporarily taking the node offline to avoid audit checks for a node that had no storage. This is what was explained should happen, yet it didn’t, so therefore it stands to reason that, even with this particular and totally recoverable user error, it was not necessarily a detrimental one and only temporary, yet Storj treated it unnecessarily as catastrophic and disqualified the node on most of the satellites needlessly. Over zealous or bugs, either way, it’s unacceptable for storage node operators.

How is that in any way the responsibility of or related to the storage node operator? If Storj implement a system to gracefully exit, they should allow this to work as flawlessly as any other operation on their storage node software. It’s their responsibility to make it work as advertised and if it doesn’t then there is cause for complaint by any SNO! Again a wholly unhelpful comment. Storj have decided to provide a function to allow graceful exit so therefore it stands to reason that Storj should mitigate any risk involved in performing a graceful exit by auditing and making sure their code doesn’t needlessly make the Storj network think there is corrupt data held on a node, just like running a node normally doesn’t return corrupt data and can run for more than 7 months without once having an audit failure or reporting corrupt data. Funny that, seeing as a graceful exit is pretty much doing the same thing as when the node was first started, taking data from the network, but in reverse. There should be no difference yet it seems that Storj hold this function in lower regard than actually getting people onto the network! Both ingress and egress of SNOs should be treated with the same priority by the network and treating them any different is not only discriminatory, but also very bad business practice.

In my case the nodes have been running for over 7 months with no problems and not once has audit failed or corrupt pieces been reported or detected, so this should carry forward through a graceful exit. It’s quite disgraceful for a system and an outfit such as Storj to have their node software perform pretty much flawlessly when data is flowing to and from nodes in a normal scenario when the nodes first become active, filling with data and growing, but as soon as a graceful exit is requested to allow the node to reduce in size and peacefully drop off the network as peacefully as it came on to the network, it opens up a can of worms, or in this case a can of bugs that shouldn’t even be there!

Why is it likely? The audit score is 100% and no data was lost when I gracefully closed files, properly unmounted and then ejected the disc. How do you come to this bold assumption given the facts here?

Thank goodness for that. However I am not a coder and I leave this in the competent hands of (supposedly) people that know what they are doing with software programming, therefore I’m very unlikely to contribute to anything other than these user forums to feed back my experience, regardless of positive or negative experiences. Noting the positives and negatives, the bugs and the flawed functions of the node software and the network as a whole and providing constructive criticism in feedback in forums like these is the only way that Storj are going to improve their SNOs experience.

If I have cause to note glaring inadequacies regarding the behaviour of the node software and the network as a whole and also to make a complaint after months of flawless node performance then you can bet there are another 100 people stood behind me waiting to say exactly the same things. This isn’t an isolated incident by any stretch of the imagination and had appropriate, simple to implement software safeguards been put in place in the node software, I and many others wouldn’t be screaming out for adequate and simple software protection in the first place. This is regarding fundamental, simple checks that should have formed part of the storj node software from the start. This service has been available for years in various iterations, starting with the previous version and last year releasing the new version, so one would imagine that there’s really no excuse to not have implemented these simple software safeguards a very long time ago.

I fully understand that turning devices off abruptly, unplugging something physically or even power outages which pretty much do the same thing can corrupt data and in these instances it would be appropriate to audit check, but even then these nodes have gone offline and aren’t audited until they return and could have multiple corruptions without being detected for any number of months, still being paid, yet simply disqualifying nodes for temporarily not having access to storage is entirely inappropriate considering the data has not been proven in any way to be corrupt as it’s simply not available for a short amount of time. This strikes many people as a classic “arse about face” configuration, where one instance should swap consequences with the other.

anon27637763 · September 5, 2020, 8:39pm

Because the satellite was reporting that audits failed:

We may differ on the definition of “constructive” … I must admit that I sometimes have bad comment days too.

However, the entirety of your problems have both been noted in several prior posts on this forum… and have already been worked through by Storj… and are in fact being rolled out as fixes in the current version.

So, there’s not much for Storj to do, since it’s already been done.

I’m also running a Storj node on Enterprise level equipment. I have had zero technical issues with my node for 12 months… even though I’ve had a few sudden power outages and network outages not under my control.

The vast majority of nodes seem to work very well… until their operators on Enterprise nodes accidentally eject the storage drive.

Rapturoso · September 5, 2020, 9:07pm

So I just tried a GE on both remaining satellites but all I got was this…

Afterwards I tried the command again and both remaining nodes were gone from the list. What’s happening here?

It seems to have completed the GE for the the stefan satellite and provided a receipt but that had pretty much no data anyway, however, the saltlake satellite is showing 0% progress.

Vadim · September 5, 2020, 9:19pm

what command did you run second time?

Rapturoso · September 5, 2020, 9:35pm

The same command but now I realise that the command to GE from any satellite will only show satellites that have not been previously selected for GE.

I have however issued the exit-status command a few times to check the progress of GE on the saltlake satellite but as of yet there is no progress as it’s still showing 0.0%.

Vadim · September 5, 2020, 9:40pm

if you want to see progress you need sta change exit in commant to status exit-status with same parameters
“C:\Program Files\Storj\Storage Node\storagenode.exe” exit-status --identity-dir “C:\Users\USER\AppData\Roaming\Storj\Identity\storagenode” --config-dir “C:\Program Files\Storj\Storage Node” --log.output stderr