Changelog v1.12.3

A post was split to a new topic: After this update i have some zeros

The ā€œloading screenā€ is still presentā€¦ How about you make the loading screen just a loading indicator/spinner overlayed on the top of the page? That will make the interface still be useful while itā€™s loading.

1 Like

5 posts were merged into an existing topic: SNO Board not working due to untrusted satellite

One update on the downtime tracking topic:

If you managed to collect too much downtime you will get suspended first. The satellite gives you 7 days to fix the issue and additional 30 days later the satellite will make a decsision. So lets say we are strict and would suspend for 48 hours downtime and you managed to be offline for 2 days in a row. You get your storage node back online but these 48 hours of downtime will not go away for the next 30 days. You would be suspended for 30 days. Luckily the final decision is 7 days later. By that time you should have managed to get out of suspension mode and continue just fine.

On the other side it is also possible to collect 47 hours of downtime without trigger suspension mode. These 47 hours have been now 28 days ago. Now you are offline for 1 additional hour and that gets you into suspension mode. 1 hour later the old downtime expires and gets out of scope. You get out of suspension mode. Remember the final decision is 37 days later. Letā€™s say you have a perfect uptime score for 35 days but then you go offline again. Right at the moment it is time for the final decision and you managed to get suspended again. That would trigger disqualification.

Note: The 48 hours are only used to make this example a bit easier. It gets a bit complicated if I try to explain it with 288 hours allowed downtime. I expect that all these numbers will change. I only want to explain what you have to do when you get suspended and what you should avoid in order to not get disqualified. I hope my explaination was not too confusing.

10 Likes

Does my node have to be suspended (offline for >48h/30d) or just offline during that decision event to be disqualified?

Tell me if I understood this correctly (numbers taken from your example):
1.Offline for 48h/30d ā†’ suspended
2.If still suspended after 37d after the beginning or the end of the 48h ā†’ DQ.
3.If offline for less than 48h in the last 30days ā†’ restored (so, if my node had 47h of downtime on the 29 days ago and one hour today, I would be suspended for 1 day, right?)

So, essentially my node can be offline for 48 hours plus 7days and then if it manages to be offline for less tha 48hours during the next 30 days it would be restored. Correct?

I think the ā€œfinal decision timeā€ should be displayed somewhere in the API, as well as the time periods (48h, 30d, 7d). Different satellites (when there are non-Tardigrade satellites) could have different values for these.

I also wonder what would happen if you changed the times. Would they apply retroactively, that i, if you shortened the max downtime from 48h to 24h, would all nodes that were offline for >24h at the time of the change be instantly suspended?

does this suspension occur the instant the node comes back online or does it take time for the network to check the metrics of the node

Suspended or not suspended after 37 days to get disqualified or not disqualified.

There is a 4th option. Suspended, get out of suspension and 37 days later get suspended again = disqualification.

Yes that is correct.

That is also correct. Ofc the idea is to not start with 48 hours and I also donā€™t expect that it will be 48 hours one day.

This one kinda sucks.

Still I think the actual limit values should be visible in the API, so I could do my own tracking to know how close to DQ I am. Especially to avoid the ā€œ4th optionā€.

EDIT: also, letā€™s say my node goes offline on the 1st of the month at 00:00. After 48h (Iā€™ll just use this value as an example) it is suspended, which happens on the 3rd, at 00:00. Now, is the ā€œfinal decision timeā€ 37 days after the 1st (the beginning of the downtime) or after the 3rd (beginning of the suspension)?

1 Like

Would be great if we could get consistent email notifications about DQ and such as well, either Iā€™m only getting suspended on europe north and salt lake or the others arenā€™t sending emails.

That sounds quite complicated.

If I may: we should be careful not to design something too hard to explain to SNOs, otherwise everyone is still going to flood the forums with ā€œGot disqualified, why??ā€ topics.

Or, if it has to be complicated, please find a way to display in a clear way (with a chart?) on nodesā€™ boards what is the state of the node and what will happen when, on a timeline or similar visual interface.

Also, we need to make sure that warning e-mails get sent correctly (and in time) to SNOs when their nodes get suspended :slight_smile:

2 Likes

I have to disagree with that. Suspension mode for 30 days is clear. The question is how do you like to get out of suspension mode? You might now have some simple rules in mind but the issues with them is that you might be unable to get out of suspension mode for at least 30 days. Depending on which simple rule you pick this would mean that you get disqualified. Sure we can keep it simple and just disqualify.

For email notification and storage node dashboard it is way too early. First we have to finish the downtime tracking implementation otherwise there is nothing to display. What I am doing here is transfering the knowledge about how the suspension mode is going to work once it is finished and activated.

3 Likes

doesnā€™t hurt to keep it in mind that it would be great if it was shown on the SNOboard / dashboard or whatever the storagenode web stats is called.

1Ā½ hours of DT allowed a day tho, is pretty high i understand this is most likely just for testing purposes, but just for my understanding this will then be lowered to the 99.5% - 99.3% uptime aka 5 hours a month downtime.

and will such dt allowance pass from one month to another so that one can do some service overhaul on a system, 5 hours isnā€™t a ton of time if one wants to take something apart and troubleshoot it.
ofc counting towards some preset max saved allowed downtime or somethingā€¦ like the 48hr start limit

or would there be a way to request extended downtime for such things?
or was that just something somebody talked about and not actually a thing in planned development.

not that i need that, pretty sure iā€™m just about at the 5 hour allowed these last few months, and most reboots / shutdowns have not been required, but one doesnā€™t get many reboots when it takes like 20 minutes to do a rebootā€¦ :smiley: and maybe a week or two before the system have fully recovered lol ZFS 2.0 where art thou

ofc it doesnā€™t help when one moves around drives and the boot drive gets unallocated in the bios and then during a reboot one have been so reliant on everything working that it takes a few hours before one actually notices that it didnā€™t just come back online lol

Fair enough :+1:

I agree with @SGC though and think it wouldnā€™t hurt to start thinking about how it could be displayed in an insightful and legible manner on the web dashboard :slight_smile:

I think (and really hope) that the final limit will be higher than 5 hours, because 5 hours is unrealistic when considering long-term for a home-based system.

itā€™s not like they die from going overā€¦ just gets suspended, so no uploadsā€¦ like say if it happened a month ago one would hardly have noticed a suspension lol

ofc later that may look different and one might get a suspension at a bad time toā€¦ which might hurt, maybe most home based storagenode will end up getting filled often and then again a suspension is semi irrelevantā€¦

and tho 5 hours isnā€™t much down time, if it accumulates from month to month i wouldnā€™t see it as totally unrealisticā€¦ but we can both agree that 5 hours isnā€™t much time to do much work to a system, which is really where i see the issueā€¦ itā€™s the maintenance and troubleshooting times that one will run into a 5 hr limit really fastā€¦

i rarely have had my system down more less than 2 hours just when doing minor workā€¦ and i got a grate for low profile cards i should get installedā€¦ having them sitting lose in the pcie slots isnā€™t ā€¦ optimal ā€¦ and tho the grate fits, kindaā€¦ iā€™m not convinced it will fit perfectly so i might have to do some work on it to make it fitā€¦

and ofc i will need to take basically the entire mobo out of the serverā€¦ doing that operation in 5 hours isnā€™t very realistic and i doubt it will be the last time i will need to do something like thatā€¦ ofc i do kinda want to setup a cluster of servers or something in that regard, ofc something like having the storage separate from the node host might be the easiest most sensible way to goā€¦ which would make stuff like maintenance very easy to manageā€¦

but yeah not really home user i solutionsā€¦ but using something like esata and having a rpi or such that supports itā€¦ would essentially provide the same options for a home userā€¦ just requires a little bit of gear if one is serious about running a storagenode then.

Iā€™d say that 5 hours is already within the "no planned downtime, essentially cannot be offline at all, which would mean that I would need a cluster. Then I could shut down one of the servers, clean, repair it etc and start it back up.
If the allowed downtime accumulated (if your node was up 100% for a year, you get 60hours) itā€™s a bit less unreasonable.

1 Like

60 hours i suppose is as good a max as anyā€¦ ā€¦ but ofc 100% uptime in a year might be a trickyā€¦
that would give 1 hr dt a month during a year and then 48 hours extended dt every other yearā€¦

ofc for such extended dt one might need to coordinate between nodesā€¦ so two having required data doesnā€™t go offline at the same time and then initiates repairsā€¦ but i donā€™t see why one couldnā€™t request dt ahead of time, because since any one node is redundant its just a matter of them being in asynchronous for maintenance and thus requesting it ahead of time would allow that to happen.

in theory without really fully understanding the system xD but should work atleast from how i understand it.

and ofc like we discussed earlier suspension isnā€™t really that badā€¦ in some cases atleastā€¦

Guys, stop being obsessed with this supposedly 5 hours maximum downtime! :slight_smile:
Weā€™re not sure what it will be and as @BrightSilence summed it up nicely here (Just had 7 hours of downtime. 4 hours yesterday - #35 by BrightSilence) the plan doesnā€™t seem to limit downtime that much in the future.

8 Likes

5 posts were merged into an existing topic: Version differences in powershell and dashboard