The (possible, temporary) end of 5+ years of Storj node operation

Doddophonique · February 20, 2023, 2:28pm

I participated in Storj since V2, using a 320GB Maxtor HDD from 2008. It started on a Rock64 with Devuan Jessie.

The only downtime I’ve had during this time was:

The upgrade from the Rock64 to a RockPro64 (2GB RAM) with Void Linux.
The upgrade of the UPS to a beefier one.
The upgrade from the RP64 2GB to a RP64 4GB.
My ISP generally messing up the routing.

In October 2022, the UPS had a fault (probably a fuse went kaboom), and I had to instruct my elderly father to reconnect everything without the UPS and press every button that was needed (one hour and a half of phone call) as I’m working and living ~800km from there. Coincidentally, I was there one week prior using the last holidays I had from work in 2022.

With every time the power went off, getting bombarded with UptimeRobot notifications, I was there frantically messaging my father to “push the first button from the top on the left side”, hoping I wasn’t losing ~15TB of nodes. I planned to go there as soon as I could, on March, to put everything behind a UPS again.

Alas, today is the day something broke. I suspect one of the HDDs is not spinning up anymore, and the 0 2 in the /etc/fstab is preventing the device to boot altogether. No way of fixing it remotely, no way of having someone on site fixing it for me.

It’ll probably be my first time with suspended nodes (4 in total) in many years. It has been a nice ride. I guess I will spin up a couple new when I go back fixing that mess, but it will take a long time for sure to have everything back as it was.

I hope you are having a better day

syncamide · February 20, 2023, 7:35pm

Hi.
Where are you from?

Doddophonique · March 4, 2023, 9:04pm

I sure hope I got it back online in time! But that uptime is really scary

Doddophonique · March 4, 2023, 10:45pm

I guess this means that the node is suspended even though the suspension score is at 100%, right?

Edit: ok, I was confusing disqualification with suspension. I see disqualification is after 30 days, I remembered it being a little bit more tight but better for me I guess.

Now let’s hope that the new UPS stays online for a long time

BrightSilence · March 4, 2023, 11:04pm

The suspension is due to downtime. Unfortunately you dropped below 60%. But not by much. You’ll get 30 days to recover from that. Just keep them online. Unfortunately you won’t get any new data until the suspension is lifted, which will take roughly 18 days in your case if you are 100% online from now on. The percentage won’t change until your downtime starts aging out of the last rolling 30 days, which will start happening in 18 days after your down time of 12 days. Just be patient and you’ll be back to full force with this node. But please make sure you don’t get additional down time during the upcoming 30 days. If you’re below 60% at the end of the 30 days + 7 day grace period, you’ll be permanently disqualified.

Doddophonique · March 4, 2023, 11:21pm

Do you think a downtime of ~10-20m in the next day(s) would be detrimental to coming back from suspension? Still need to put it behind the UPS and possibly directly changing device to have it reliably 100% online in the next months/years.

dragonhogan · March 4, 2023, 11:36pm

recovery of suspension score happens really quickly once the node(s) are brought back online from what I’ve seen happen with my nodes over the years. So another 10-20 minutes won’t hurt anything.

BrightSilence · March 5, 2023, 3:33am

It shouldn’t. Just don’t have more days of offline time.

Alexey · March 5, 2023, 5:31am

This is true only for suspension score - your node just need to start passing all audits.
But online score can be fully recovered only after 30 days online. Each downtime requires additional 30 days to recover.
This is due to 30 days rolling window.

Doddophonique · March 5, 2023, 4:15pm

So, with this knowledge, I’m assuming that for a node that needs to recover from offline suspension (using Docker), the best approach is to stop auto-updating it with Watchtower until it gets unsuspended.

For example, I put everything behind the UPS and had two minutes of downtime. From what I’m understanding from this link, if I got an audit in that two minutes window, my nodes were considered offline.

Also:
“The review period consists of one grace period and one tracking period. The grace period is given to fix whatever issue is causing the downtime. After the grace period has expired, any offline audits will fall within the scope of the tracking period, and thus will be used in the node’s final evaluation. If at the end of the review period, the node is still suspended, it is disqualified. Otherwise, the node is no longer under review.”

But I see no indication about the length of the grace period. Anyone has clear timeframes about this?

GollyTicker · March 5, 2023, 7:48pm

I had a similar question recently: Answer might be relevant for you: Node had to be offline for a few days. Continue or restart node from fresh?

BrightSilence · March 5, 2023, 8:56pm

The process is outlined here: storj/docs/blueprints/storage-node-downtime-tracking-with-audits.md at e24262c2c9d7b2d48897d0de767f8012d95c444c · storj/storj · GitHub

The periods timelines aren’t mentioned, but it’s 7 days for the grace period and 30 days for the review window.

The dashboard doesn’t display any of this information, but the grace period starts the moment you dropped below 60% and that moment is listed in your notification.

Alexey · March 6, 2023, 4:26am

No, you should not. The updater is integrated to the docker container, so it will update storagenode anyway, when the new version would be available. So it makes no sense to disable watchtower - it updates the base image if it is changed. We rarely change the base image (security updates mostly).