Design Draft: Storage Node "Suspended" State

deathlessdd · February 13, 2020, 8:12pm

Is there going to be a reason because … is an issue or You need to fix this soon or you will be Disqualified. On the dashboard explaining the issue or is it gonna be on the storage operator to figure it out?

moby · February 13, 2020, 8:12pm

Ok, ideally we would be reporting this information almost the same way as DQ, so it would be monitor-able. As we discussed above, we are also considering an email notification component. I’m not sure about the difficulty of implementing that yet.

moby · February 13, 2020, 8:14pm

Since we do not save audit errors explicitly on the satellite, it would be largely up to the storagenode to look through their logs and figure it out for themselves. There is a chance we would be able to check out the satellite logs to help, but that solution might not scale very well or be easily automated.

moby · February 13, 2020, 8:15pm

Yes, the normal audit reputation (success vs. failed) will be the same as before. It will retain its vetted status as well as any pieces that were not replaced by repair during the suspended state.

nerdatwork · February 13, 2020, 8:16pm

Many SNOs have opted to use disposable emails and/or fake emails in their setup but every SNO is using dashboard. I would recommend giving a notification under web dashboard’s section and also giving a RED warning text in CLI dashboard.

moby · February 13, 2020, 8:18pm

If we can do notification entirely through the dashboard, I would prefer that from an implementation/code standpoint. I am reluctant about committing to sending emails.

deathlessdd · February 13, 2020, 8:18pm

Ok either way I always keep a close eye on my nodes but for the people who just set it and forget it are probably going to have serious issues, Especially the people who are running 100+ nodes but as long as everything is good on there side. But theres always going to be a chance on people getting DQed because they never check on there nodes, I think if possible an email alert would be a good idea to be able to check on these conditions asap.

nerdatwork · February 13, 2020, 8:23pm

As a backup email notification is also good but I have seen people with 10k unread emails. A blinking red banner in web dashboard would definitely grab attention along with notification under bell icon. This banner will only go away if the issue is fixed (just a thought).

deathlessdd · February 13, 2020, 8:25pm

I agree with this as well, Probably would be hard to track down a certain node if you use different emails or use the same email for every node unless you knew exactly which node it was you probably wouldnt find it right away. But if it was on the dashboard you would know exactly which node it was assuming you actually check on it from time to time.

Vadim · February 13, 2020, 8:37pm

I would sujest to add notifications about database error, mailformed or or Disk I/O error,

As they are crytical errors, that need to be monitored, and only in logs can find it now.

Pentium100 · February 13, 2020, 8:39pm

I would rather get an email notification - I keep unread emails to zero (by not signing up to newsletters etc - only important stuff goes to email) and I check my email a few times per day, while I check the web dashboard very infrequently (CLI dashboard a bit more frequently).

nerdatwork · February 13, 2020, 8:45pm

I don’t mean to remove email notification altogether just that if this is implemented in stages then email notification could be added as backup. The idea is to notify SNO in every possible way rather than lose a good node because SNO didn’t get proper notification.

moby · February 13, 2020, 8:47pm

A notification about going into suspension would indirectly indicate these types of errors - they all result in unknown audit errors. But it would still require the node operator to look through logs to find the specific issue.

Adding notifications that originate from the storagenode in the event of these issues is a separate topic, and I am not sure if discussing them here will be particularly fruitful.

BrightSilence · February 13, 2020, 9:14pm

I assume those pieces will not be removed from the node during that repair transaction. Will they be cleaned up by garbage collection later?

moby · February 13, 2020, 9:16pm

Yes. This is the case for any unhealthy node that is removed from a segment during repair - the deletion occurs through garbage collection later.

moby · February 13, 2020, 9:43pm

Note to all: document has been updated with suggestions. If you want, feel free to re-review and let me know if I missed anything we discussed (or made a mistake).

baker · February 14, 2020, 12:29am

I still don’t think that this arrangement is adequately explained as the consequences of being in the suspended state.

Unless you feel this is outside of the scope of the design draft. Either way, SNOs will need this communicated to them.

Otherwise I think the rest of the changes are good.

moby · February 14, 2020, 2:48pm

I thought this was implied - but that is mainly because I am familiar with how the database queries for storagenodes work

So I will update the document to explicitly mention this when I have time today.