Does this mean that having a lot of small nodes in a single block C leads to none of them ever disqualifying?
It only means a small node is not getting suspended right away. Even a node that is holding only one single piece will get “unlucky” at some point that the satellite was auditing that one piece at least once every 12 hours. That might not happen in the first month but it will happen at some point. I would expect that “a lot of small nodes” will just get slowly suspended over time and they will be unable to recover and just get DQed.
Beside that the repair worker will not wait for your node getting suspended. If you are offline for a few hours the repair job will simply move the data. This can make you a lucky node that never gets suspended but you will constantly lose data. I don’t think that is a good tradeoff.
A post was split to a new topic: After this update i have some zeros
The “loading screen” is still present… How about you make the loading screen just a loading indicator/spinner overlayed on the top of the page? That will make the interface still be useful while it’s loading.
5 posts were merged into an existing topic: SNO Board not working due to untrusted satellite
One update on the downtime tracking topic:
If you managed to collect too much downtime you will get suspended first. The satellite gives you 7 days to fix the issue and additional 30 days later the satellite will make a decsision. So lets say we are strict and would suspend for 48 hours downtime and you managed to be offline for 2 days in a row. You get your storage node back online but these 48 hours of downtime will not go away for the next 30 days. You would be suspended for 30 days. Luckily the final decision is 7 days later. By that time you should have managed to get out of suspension mode and continue just fine.
On the other side it is also possible to collect 47 hours of downtime without trigger suspension mode. These 47 hours have been now 28 days ago. Now you are offline for 1 additional hour and that gets you into suspension mode. 1 hour later the old downtime expires and gets out of scope. You get out of suspension mode. Remember the final decision is 37 days later. Let’s say you have a perfect uptime score for 35 days but then you go offline again. Right at the moment it is time for the final decision and you managed to get suspended again. That would trigger disqualification.
Note: The 48 hours are only used to make this example a bit easier. It gets a bit complicated if I try to explain it with 288 hours allowed downtime. I expect that all these numbers will change. I only want to explain what you have to do when you get suspended and what you should avoid in order to not get disqualified. I hope my explaination was not too confusing.
Does my node have to be suspended (offline for >48h/30d) or just offline during that decision event to be disqualified?
Tell me if I understood this correctly (numbers taken from your example):
1.Offline for 48h/30d → suspended
2.If still suspended after 37d after the beginning or the end of the 48h → DQ.
3.If offline for less than 48h in the last 30days → restored (so, if my node had 47h of downtime on the 29 days ago and one hour today, I would be suspended for 1 day, right?)
So, essentially my node can be offline for 48 hours plus 7days and then if it manages to be offline for less tha 48hours during the next 30 days it would be restored. Correct?
I think the “final decision time” should be displayed somewhere in the API, as well as the time periods (48h, 30d, 7d). Different satellites (when there are non-Tardigrade satellites) could have different values for these.
I also wonder what would happen if you changed the times. Would they apply retroactively, that i, if you shortened the max downtime from 48h to 24h, would all nodes that were offline for >24h at the time of the change be instantly suspended?
does this suspension occur the instant the node comes back online or does it take time for the network to check the metrics of the node
Suspended or not suspended after 37 days to get disqualified or not disqualified.
There is a 4th option. Suspended, get out of suspension and 37 days later get suspended again = disqualification.
Yes that is correct.
That is also correct. Ofc the idea is to not start with 48 hours and I also don’t expect that it will be 48 hours one day.
This one kinda sucks.
Still I think the actual limit values should be visible in the API, so I could do my own tracking to know how close to DQ I am. Especially to avoid the “4th option”.
EDIT: also, let’s say my node goes offline on the 1st of the month at 00:00. After 48h (I’ll just use this value as an example) it is suspended, which happens on the 3rd, at 00:00. Now, is the “final decision time” 37 days after the 1st (the beginning of the downtime) or after the 3rd (beginning of the suspension)?
Would be great if we could get consistent email notifications about DQ and such as well, either I’m only getting suspended on europe north and salt lake or the others aren’t sending emails.
That sounds quite complicated.
If I may: we should be careful not to design something too hard to explain to SNOs, otherwise everyone is still going to flood the forums with “Got disqualified, why??” topics.
Or, if it has to be complicated, please find a way to display in a clear way (with a chart?) on nodes’ boards what is the state of the node and what will happen when, on a timeline or similar visual interface.
Also, we need to make sure that warning e-mails get sent correctly (and in time) to SNOs when their nodes get suspended
I have to disagree with that. Suspension mode for 30 days is clear. The question is how do you like to get out of suspension mode? You might now have some simple rules in mind but the issues with them is that you might be unable to get out of suspension mode for at least 30 days. Depending on which simple rule you pick this would mean that you get disqualified. Sure we can keep it simple and just disqualify.
For email notification and storage node dashboard it is way too early. First we have to finish the downtime tracking implementation otherwise there is nothing to display. What I am doing here is transfering the knowledge about how the suspension mode is going to work once it is finished and activated.
doesn’t hurt to keep it in mind that it would be great if it was shown on the SNOboard / dashboard or whatever the storagenode web stats is called.
1½ hours of DT allowed a day tho, is pretty high i understand this is most likely just for testing purposes, but just for my understanding this will then be lowered to the 99.5% - 99.3% uptime aka 5 hours a month downtime.
and will such dt allowance pass from one month to another so that one can do some service overhaul on a system, 5 hours isn’t a ton of time if one wants to take something apart and troubleshoot it.
ofc counting towards some preset max saved allowed downtime or something… like the 48hr start limit
or would there be a way to request extended downtime for such things?
or was that just something somebody talked about and not actually a thing in planned development.
not that i need that, pretty sure i’m just about at the 5 hour allowed these last few months, and most reboots / shutdowns have not been required, but one doesn’t get many reboots when it takes like 20 minutes to do a reboot… and maybe a week or two before the system have fully recovered lol ZFS 2.0 where art thou
ofc it doesn’t help when one moves around drives and the boot drive gets unallocated in the bios and then during a reboot one have been so reliant on everything working that it takes a few hours before one actually notices that it didn’t just come back online lol
I agree with @SGC though and think it wouldn’t hurt to start thinking about how it could be displayed in an insightful and legible manner on the web dashboard
I think (and really hope) that the final limit will be higher than 5 hours, because 5 hours is unrealistic when considering long-term for a home-based system.
it’s not like they die from going over… just gets suspended, so no uploads… like say if it happened a month ago one would hardly have noticed a suspension lol
ofc later that may look different and one might get a suspension at a bad time to… which might hurt, maybe most home based storagenode will end up getting filled often and then again a suspension is semi irrelevant…
and tho 5 hours isn’t much down time, if it accumulates from month to month i wouldn’t see it as totally unrealistic… but we can both agree that 5 hours isn’t much time to do much work to a system, which is really where i see the issue… it’s the maintenance and troubleshooting times that one will run into a 5 hr limit really fast…
i rarely have had my system down more less than 2 hours just when doing minor work… and i got a grate for low profile cards i should get installed… having them sitting lose in the pcie slots isn’t … optimal … and tho the grate fits, kinda… i’m not convinced it will fit perfectly so i might have to do some work on it to make it fit…
and ofc i will need to take basically the entire mobo out of the server… doing that operation in 5 hours isn’t very realistic and i doubt it will be the last time i will need to do something like that… ofc i do kinda want to setup a cluster of servers or something in that regard, ofc something like having the storage separate from the node host might be the easiest most sensible way to go… which would make stuff like maintenance very easy to manage…
but yeah not really home user i solutions… but using something like esata and having a rpi or such that supports it… would essentially provide the same options for a home user… just requires a little bit of gear if one is serious about running a storagenode then.
I’d say that 5 hours is already within the "no planned downtime, essentially cannot be offline at all, which would mean that I would need a cluster. Then I could shut down one of the servers, clean, repair it etc and start it back up.
If the allowed downtime accumulated (if your node was up 100% for a year, you get 60hours) it’s a bit less unreasonable.
60 hours i suppose is as good a max as any… … but ofc 100% uptime in a year might be a tricky…
that would give 1 hr dt a month during a year and then 48 hours extended dt every other year…
ofc for such extended dt one might need to coordinate between nodes… so two having required data doesn’t go offline at the same time and then initiates repairs… but i don’t see why one couldn’t request dt ahead of time, because since any one node is redundant its just a matter of them being in asynchronous for maintenance and thus requesting it ahead of time would allow that to happen.
in theory without really fully understanding the system xD but should work atleast from how i understand it.
and ofc like we discussed earlier suspension isn’t really that bad… in some cases atleast…
Guys, stop being obsessed with this supposedly 5 hours maximum downtime!
We’re not sure what it will be and as @BrightSilence summed it up nicely here (Just had 7 hours of downtime. 4 hours yesterday - #35 by BrightSilence) the plan doesn’t seem to limit downtime that much in the future.