Node Churn - Why did you shut down your node?

Alexey · November 9, 2022, 8:09am

I’m sorry that you think I’m so rude. I just believed you that you have only successful audits in these logs (in that case they are useless to understand the reason for failed audits), and you need to have more time period logs, where audits are failed.
Since you did not provide a more wide period, I assumed that you do not have old logs.

Toyoo · November 9, 2022, 8:25am

I’m using the same email address for the forum registration and for node notifications. If Storj cared, they’d find my nodes easily.

Knowledge · November 9, 2022, 8:50am

It’s basically a two way street. The Community Admins don’t have access to Storj data. The rest of the company doesn’t have administrative access to the forum.

Of course if the executives asked for such access, we would provide it. But generally this doesn’t happen. They leave the forum for us to manage and we advocate on your behalf to the company.

The server itself, as I understand it, is hosted by a third party and he works directly with the admins.

It has always been important for Storj to allow the Community to be included but separate with the rest of the company. This way they get a clear picture of honest opinions from users without there being pressure to manage those opinions to meet some marker or threshold.

We try to strike a fair balance between providing information to all of you from the company, and provide the company with your comments and concerns. It works well.

John_bravo87 · November 9, 2022, 8:50am

Been running nodes since the times of V2 (back in 2017 I think), only real reason I discarded any was when I had issues with raspberry pi + USB disk setup, where I had USB controller periodically fail and unmount the disk. Proper x86 nodes with raidz2 (4 and 5 disk arrays) have been reliably working for the past 3 years at least (despite a few power outages and network outages).

striker43 · November 9, 2022, 9:09am

I lost my first node due to an exploit in the QNAP OS that allowed hackers to use some ransomware and encrypt all my data. That was more than 2 years ago. Since then I started using a real server running linux and all my nodes are up and running so far

Fogel · November 9, 2022, 10:41am

Was running a node back at V2 days and created a new node when V3 entered beta and later on release.
After having 12-14TB data or so I needed some diskspace for myself and rethink the NAS. I also grew slightly tired of only getting paid every 2-3 months with the only solution being “just get zkSync” and “”“threats”“” (heavy quotation marks, because it’s just what I felt it was) that usual L1 payments would eventually stop. I’m still not sold on L2 solutions

Did a graceful exit, enjoyed all the held amount I got back from successful exit, a few months later I wanted to get into it again so I started 3 new nodes with a better disk setup.

msallak1 · November 9, 2022, 10:41am

I would suggest allowing early graceful exit and getting 50% of held balance.

Even if no held balance will be refunded some users would prefer to help with repairing other nodes before they leave the network instead of just deleting their nodes.

SGC · November 9, 2022, 12:01pm

i did a GE on my oldest 17TB node because its earnings was inferior to newer nodes.
and it had a decent held amount, because it was from pre tardigrade / storj DCS launch.

deathlessdd · November 9, 2022, 1:47pm

Sadly I lost a few nodes over the last 3 years for different reasons testing etc, Lost the oldest node for SMR drive reasons which are currently unknown because the drive still works fine.

lyoth · November 10, 2022, 12:31am

I removed 1 node where I rsync and forgot to use --delete, luckily, it was a small node that was still vetting.

DD7 · November 10, 2022, 3:46am

This is really good to know. Thank you for clarity.

Alexey · November 10, 2022, 6:30am

If you rsync your node fully, but forgot to use --delete, it can destroy only databases, which you can easily recreate in case of corruption, you could lose the stat, but the node should be working anyway.

jorma · November 10, 2022, 12:24pm

If the last 363 audits of a node were successful, is it really realistic that audits prior to those 363 successful audits could be potentially responsible for a node to get disqualified?

And how could you possibly have known how much running time is documented in the logs if you don’t even look at it?

Toyoo · November 10, 2022, 3:01pm

If the failed audits have never been logged, then having 363 successful audits logged don’t matter much. Failed audits that aren’t logged were so far observed to happen in cases such as silent file system/disk failure or during extremely heavy I/O starvation scenarios. We now know some common failure points that can cause heavy I/O which weren’t known in April 2020: for example SMR drives, and btrfs (for some reason especially in tandem with Synology), often visible only after a node grows to terabytes. Whether this happened on your node, I am not going to hypothesise, I’m just stating a possible scenario in which your observations may have happened.

Though, claiming that you did not receive enough help on a best-effort community forum help is not proper. Nobody here is obligated to give help. It is also very difficult and time-consuming to try to debug a problem only through a forum and without direct access to hardware, all we can do is to speculate, while all experiments and observations must be done by the person asking for help. It might be in best interest for Storj to provide help, like in form of community leaders, this is a concern that needs to be balanced against many others.

Pac · November 12, 2022, 7:45am

Had nodes since the end of 2019, and never lost one so far. Although, that’s mainly thanks to uptimerobot that always caught issues in time! And all the help from many fine folks on this forum

Almost lost nodes many times, for reasons like:

Broken 3.5" HDD case with a power leak (must be rare!) that caused the disk to occasionnaly disconnect abruptly. This caused the node to lose data and start failing audits. I replaced the case, and this node is now still around, with an audit score that got better with time as valid data accumulates. This node should have been DQed in all honesty, but the previous audit system did allow for some leeway that spared it by sheer luck…
Some nodes did have a hard time staying on board because of a sluggish 2.5" SMR drive, that the node software still dosen’t know how to handle. When activity is high, this disk cannot keep up, data stackS up in RAM and the OOM killer strikes at some point. That caused me many suspensions until I restricted performances of my SMR nodes.
Many issues with the RPi4B hosting my nodes, as it’s definitely not a reliable 24/7 server. Main issues faced with it: USB connectors unreliability, overheating (long ago, with early firmware version and no heat sink at that time), SD card corruption before we could officially boot it from USB, an official power supply died, and lately I had major issues with the USB controller…

Never had to reclaim some space yet, but I’m sad SNOs still don’t have a way to reduce the amount of allocated space via a feature announced long ago: the partial graceful exit. Or by simply reducing the value configured in the docker command…
If I had to reclaim some space, that would force me to shutdown one of my nodes.

BrightSilence · November 12, 2022, 11:37am

I’ve never lost a node, but had a bit of a scare yesterday. I had the first power outage in over a decade of living in this place… while my main RAID Array was still expanding. I don’t have a UPS since power is so stable, I would have to replace the batteries 5 times before I ever got to use it. So… that was an instant unsafe shut down with an array actively being reshaped. I have to say I was a little nervous what impact that had until the power came back about 40 minutes later.

Of course for important personal stuff hosted on that array, I have backups. But not for the nodes. And my biggest node of over 20TB is on that array. Luckily, it seems to have recovered just fine and the reshape simply resumed where it left off. Did some googling and Synology did say it would resume, but advised to use a UPS anyway, but I’m not even sure a UPS would have been able to keep it online for 40 minutes and shutting down while expanding is not possible. So yeah. All seems fine for now, but this could have been a worse outcome.

Pentium100 · November 12, 2022, 10:00pm

If the UPS has a battery that is big enough, it can keep powering stuff long.

The power is pretty stable for me as well, but I have multiple layers of UPSs and am setting up a generator.

Toyoo · November 12, 2022, 10:31pm

That’s surprising. I think LVM can do that just fine, mdraid as well. Isn’t Synology basically using these two in tandem?

BrightSilence · November 12, 2022, 10:37pm

They tend to be overly cautious. I haven’t actually tried shutting it down, I just read it somewhere. So can’t confirm they actually don’t allow it. Either way, I can now confirm the expansion finished successfully and everything is purring along just fine. And yeah, Synology uses mdraid + lvm for SHR.

CablingOptional · November 15, 2022, 1:03pm

After several years, I shut down my Raspberry Pi 4 node which had just under 4 TB of utilized storage (external 7200 RPM USB 3.0 drive) via the graceful exit process. I had already received back most funds that had been held, and also had another node running within my home VMware ESXi environment.

The reason? Raspberry Pis became super hard to find during the pandemic supply chain woes, and I had other fun, geeky projects to work on! And the income lost from this node was offset by my other Storj node.