Restarted suspended node: A week uptime, no Online-% improvement, constant disk IO, climbing trash?

After restarting a node that was down 2 weeks my Online percentages for all satellites still hovers mid/high 50’s: no improvements. It is ‘Online’ 1.24.4 and shows success in the logs (https://support.storj.io/hc/en-us/articles/360029233952-Some-statistics-from-logs). However it’s also showing continuous disk activity, and the trash count has climbed from about 200GB to almost 500GB in a week.

When a node restarts… does it go through everything stored to determine what’s not needed anymore (lost contracts for disk space get flagged as trash)? The node is using SMR drives, which I know are slower: maybe it hasn’t finished some sort of cleanup… but if I wait long enough it will complete and start increasing my Online metric again?

For now I’ll just leave it. I’m just not sure if this is normal when recovering from a suspension or not. Thanks!

It will take some time for your online scores to recover. If you keep your node online, it will take 30 days to get back to 100%.

On start-up the node looks at each piece to determine the amount of space being used by the node. This happens every time the node is started/restarted. This process has nothing to do with moving items to the trash. But since your node was offline for so long, it may have received the bloom filters from the satellites which will trigger trash collection. If these two processes are happening at the same time, your node is going to be hitting the hard drives quite hard with random IOPS, and as you mentioned SMR is not ideal for this.

It will eventually finish and things should calm down between the startup “filewalker” process and the trash collection process. My advice would be to avoid restarting the node during this time.

3 Likes

That makes sense, and aligns with what I’m seeing. Thanks for explaining!

1 Like

Also, depending on how much data your node stores and the type of SMR drive you have, be aware that the process can take an aweful long time. It takes almsot 30 hours on one of my nodes which is using a poor 2.5" SMR drive storing 2TB of data… Just so you know :slight_smile:

As long as your iowait doesn’t go crazy on your system (can be checked with top on linux) and as long as RAM usage doesn’t grow continuously, then it’s just doing its thing normally.

If iowait and/or RAM usage start climbing abnormally, then something’s off.

3 Likes

In case someone else goes through this…

…the system stayed at high load/iowait for 8 days (8TB stored), with Trash climbing as high as about 480GB. Then over the next couple days the Trash dropped into the mid-300’s. And now, 12 days after starting the node on v1.24.4… my Online scores finally went up one tick (e.g. one satellite went from 58.19% Online, to 59.94%). And Trash is 280’s and seems to be dropping…

So if you have lots stored, and slow SMR drives… be very patient if you’re coming back online after a suspension.

4 Likes