Disqualified after 3 years

ACarneiro · August 31, 2023, 6:56pm

And it isn’t how it is, provided the nodes are running in the sanctioned way (one node, one HDD).
I do feel OP’s pain but it seems like unfortunately the non-standard setup caused the problem and Storj can’t really cater for the innumerable ways of setting up a non-standard system to make sure disqualifications don’t happen too quickly.

BrightSilence · August 31, 2023, 7:12pm

Sure, if you completely ignore the fact that they implemented the read/write check to exactly solve the problem you outline and conveniently skip over the fact that your non-standard and advised against setup broke that check, then yeah, all of those concerns have gone nowhere. Have you given thought to why all the topics you found are two years old? Let me give you a hint, no one has run into this issue anymore in more recent times. I wonder why that is…?

AndMetal · August 31, 2023, 8:03pm

So just to be clear, are you saying it’s Storj’s position that, if you have a larger amount of data (so more frequent audits) and you have a temporary issue preventing blobs from being read (that can be easily rectified once identified with no data loss, like this: Disqualified. Could you please help to figure out why?) that operators should be disqualified and forced to start from the beginning if they want to continue with the project? If not, please help me understand what you’re trying to convey, because that’s how it’s coming across to me.

Is seeing everything return I/O or empty file errors when trying to read from inside the blobs directory not something that can be done when trying to serve data or respond to an audit? In other words if 100% of reads are failing why would the node not terminate? Why is one arbitrary file in the root of a large directory structure (18.5M files across 6,150 directories) the only safety net?

snorkel · August 31, 2023, 8:17pm

I’m curious… you only got the disqualification emails? There should be “node offline” emails too.

BrightSilence · August 31, 2023, 8:29pm

Clearly not, hence why the read/write check was implemented to check whether there is a mount issue. You can keep ignoring your part in why that system didn’t work for you, but you’re not getting better answers. For the node, it looked like your HDD was available, yet all the data was gone. Do you expect to not be disqualified if all data is suddenly gone while the HDD seems to be accessible?

It does! But you broke that check with your setup.

Because if you allow missing data or unreadable data to pass the audit system, it opens the door to manipulating it to get away with actual data loss.

Nope, because the node never went offline, since the storage location was readable, just the blobs weren’t.

AndMetal · August 31, 2023, 8:47pm

My understanding is this fixes the “loose cable” scenario, but not pietro’s which if I read that thread correctly was a system that was still responding (dashboard was still up, was getting and attempting to respond to audits) but the audit data wasn’t making it back to the satellite due to some temporary issue in the kernel (under Docker it sounds like). I feel like my scenario is a little closer to that than a USB drive that lost connection.

Where in the documentation is using Storage Spaces or junctions under Windows advised against? The closest thing I could find was “The network-attached storage location could work, but it is neither supported nor recommended!” (Storage Node - Storj Docs). To be clear this exact same setup has worked flawlessly for YEARS (since August 3rd, 2020). The only major issue I’ve had was updates failing to install automatically a few years ago which I eventually remedied.

Those are just the ones linked to what jammerdan posted (New audit scoring is live). It took me a while to read through them, and unfortunately I don’t have unlimited time to scour the forum, Reddit posts, etc. In reading through some of those posts though, I’m curious if your opinions have changed over the years:

Put the node into an offline/suspended state when audits are failing

Put the node into an offline/suspended state when audits are failing

I think worst scenario is what we have right now: Quick disqualification without prior notification, without indication and without the possibility to revoke.

Right, so this is the core issue. And I think most of what we’ve seen recently has some common characteristics. It tends to be that the node becomes unresponsive and doesn’t even log, but apparently does just enough to signal to the satellite that it is still online. I’m with @Alexey that this isn’t all that common, but at the same time, it may be the most common issue that node operators face atm. We’re still dealing with nodes that run into a failure state though and almost certainly not one caused by the node software itself. So ideally we would get better ways to monitor that so we can take better care of our nodes. But I don’t think it’s on Storj to fix the underlying issues here or be more lenient on the requirements.

Put the node into an offline/suspended state when audits are failing

Put the node into an offline/suspended state when audits are failing

Then maybe nodes should not be disqualified in 4 hours with no notification and no chance to get them back when the OS freezes?

This may have gotten lost in the discussion, but I definitely agree with this as well. I do think the “when the OS freezes” part is key here though. I think the faster the better to disqualify nodes that have actually lost data or corrupted data.
This problem does need a solution. And for some reason it’s been popping up on the forum more than it did before. Enough that I think it’s something that needs to be looked into. I just don’t agree with all solutions that have been proposed here. I think there are better ways to fix it.

AndMetal · August 31, 2023, 8:55pm

Correct, the node was up the entire time (before the drive went offline, while the drive was offline, and after I brought the drive back online which involved opening up the Storage Spaces control panel and clicking Bring Online, about a 2 minute process).

What I’m gathering is that if either I had node_storage and blobs/temp/trash all on the same “drive” or pointed the config file to a different location for the databases so that node_storage and everything inside it could be on the same “drive” then the read/write check mechanism (which I believe only reads and writes to 1 specific file, node_storage/storage-dir-verification) would have caught it (because storage-dir-verification would have disappeared and wouldn’t be able to be recreated until the drive came back online) and the node would have gone offline within at most a few minutes, which would eventually generate the offline Email and should have prevented me from being disqualified.

littleskunk · August 31, 2023, 9:08pm

For sure because years ago that read and write check was not in place / was missing a timeout to catch some other edge cases.

BrightSilence · August 31, 2023, 9:21pm

It’s not. Your situation is literally a storage location being unavailable. Which is a solved issue under normal setups.

The forums and documentation is littered with the recommendation to use one HDD per node. That clearly covers storage spaces and you can’t really expect storj to address any for of exotic setup like with junctions. But yeah, you’re using junctions to also combine HDD’s so it’s covered by that as well.
There is also an official way to move db’s to SSD which you didn’t use.
I know it sucks, but your setup was a little janky and you’ve added more failure surfaces by adding more HDD’s that were critical for the working of your node. You can’t really blame Storj when running such a setup.

They haven’t. I feel a little like a broken record here, but the issue I addressed there has since been resolved. Your exotic setup broke that solution.

This is literally what the node now does.

Toyoo · August 31, 2023, 10:57pm

Actually, the safety feature I brought in my first post, and which your junctions defeated was implemented exactly to address these concerns. We then stopped getting cries for reverting disqualification. This was a good day in Storj’s node operator community.

Yep, this is correct. That’s how it is designed to work.

Before this safety feature was designed I had a swatchlog-based setup designed to catch these kinds of problems. It’d kill the node the moment there is a single failed audit. Thankfully I never really had it trigger outside of my manual tests. I no longer need it, the write check is sufficient. If you insist on making a non-standard setup, this would probably be the first thing to do, though.

jammerdan · September 1, 2023, 3:21am

Well I still see that a node can be disqualified within hours compared to a node can be offline for 30 days.
So basically what anyone would have to do is to count consecutive failed audits from the logs and when it hits a certain threshold then rather shut the node off.
Then you’re better off and have 30 days time to fixe the underlying issue.
My thoughts are that something like this should be the standard way. But I must admit, I always think of a honest node operator who is facing a real issue and not a node node operator who is gaming the system.

Alexey · September 1, 2023, 3:39am

The writeability check creates a small file inside the storage folder, i.e.:

> ls X:\storagenode2\storage\

    Directory: X:\storagenode2\storage


Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         7/17/2019   1:03 AM                blob
d-----         1/26/2021   3:21 AM                blobs
d-----          9/1/2023   5:28 AM                garbage
d-----          7/5/2023   6:53 AM                temp
d-----         1/30/2021  10:03 PM                trash
-a----          9/1/2023   5:44 AM       56696832 bandwidth.db
-a----          9/1/2023   5:44 AM          32768 bandwidth.db-shm
-a----          9/1/2023   5:45 AM        6019352 bandwidth.db-wal
-a----          9/1/2023   3:02 AM         143360 heldamount.db
-a----         8/29/2023   3:10 PM          16384 info.db
-a----         8/29/2023   3:10 PM          24576 notifications.db
-a----         8/29/2023   4:33 PM          32768 orders.db
-a----          9/1/2023   5:38 AM          32768 orders.db-shm
-a----          9/1/2023   5:38 AM              0 orders.db-wal
-a----         8/29/2023   3:10 PM          24576 pieceinfo.db
-a----          9/1/2023   5:28 AM          32768 pieceinfo.db-shm
-a----          9/1/2023   5:28 AM              0 pieceinfo.db-wal
-a----          9/1/2023   5:36 AM        9605120 piece_expiration.db
-a----          9/1/2023   5:44 AM          32768 piece_expiration.db-shm
-a----          9/1/2023   5:36 AM             32 piece_expiration.db-wal
-a----          9/1/2023   4:54 AM          24576 piece_spaced_used.db
-a----          9/1/2023   5:24 AM          32768 piece_spaced_used.db-shm
-a----          9/1/2023   5:24 AM             32 piece_spaced_used.db-wal
-a----         8/29/2023   3:10 PM          24576 pricing.db
-a----          9/1/2023   3:00 AM          36864 reputation.db
-a----         8/29/2023   4:30 PM          32768 satellites.db
-a----          9/1/2023   5:29 AM          32768 satellites.db-shm
-a----          9/1/2023   5:29 AM              0 satellites.db-wal
-a----         8/29/2023   3:10 PM          24576 secret.db
-a----         12/6/2020   6:08 PM             32 storage-dir-verification
-a----          9/1/2023   3:02 AM         851968 storage_usage.db
-a----         8/29/2023   3:10 PM          20480 used_serial.db
-a----         5/19/2020  10:37 PM      334069760 used_serial.db.bak
-a----         9/21/2021  12:27 AM              0 write-test428211014

in my case it has a name write-test428211014 (I left this for examples, usually this file is deleted after the test).
For readability check it uses storage-dir-verification file, it contains a NodeID. This file also used to check that the identity is matched the data.

So, in result the node has been disqualified for losing data. I’m sorry, but reinstatement is not possible in this case, because there is no guarantee, that data is intact, it requires to execute audit for every single piece on your node for the same price as repair, but in case of repair it may never happen (the unhealthy threshold is not reached or the customer deleted their data). It also would be dedicated to only one node from 20k, too expensive.

AndMetal:

Is there any documentation on the different configuration options? I did some digging both on the Storj site (support, documentation, etc) and the Github project but didn’t really find anything useful other than a handful of command line options. I did notice this in my config file:
# directory to store databases. if empty, uses data path
# storage2.database-dir: ""

BrightSilence · September 1, 2023, 6:38am

If it has lost all data, but the HDD is still responding fast. Yes. I don’t see an issue with that really.

AndMetal · September 1, 2023, 11:39am

That’s not checking if their is a mount issue, that’s checking to see if you can read/write to 1 specific file in 1 specific location.

Obviously not, otherwise the node would have terminated. The difference is checking reading and writing to a specific location vs consistent reading of normal data (not just random data, but what it’s being asked to get). What I’m suggesting is if the node attempts to read 50 files and it fails to read all 50 files (I/O error, empty, missing, etc) and it was fine before then it should probably stop because something is wrong. The exact amounts, times/durations, etc, should be tuned or configurable so that a node doesn’t just drop because of heavy I/O, but if I can identify this by parsing through the log file then why on earth wouldn’t we be doing this as part of the safety net in the node? Just track attempts vs failures for the past few minutes cycling out the old data to reduce overhead, and if the % is 100% for X amount of time then ABORT MISSION, THE SHIP IS SINKING.

BrightSilence · September 1, 2023, 11:52am

Yes, in the storage location… where the data is.

It’s no use discussing this any further. It seems you are unwilling to accept that there is a perfectly suitable solution that has worked for anyone but you. And the reason it didn’t work for you is because you made some dubious choices in your setup. I’ve mentioned before that accepting any loss of actual data in the way you suggest would open the door to abuse, yet you suggest to work around actual data loss yet again.

I feel for you, I do. It completely sucks. But you’re arguing a losing battle if you expect changes to be made for a one off strange setup causing an issue that wouldn’t be a problem for anyone else. Especially since the reason you used that setup was to do something that the software has a native function for. Time to just accept your mistake and start over. And in the future, if you think you have a clever solution for something, use this forum account and post it. You’ll get plenty of feedback from community and Storj on how to do those things the right way and what risks would be involved if you do them the wrong way. Even if the read/write check wouldn’t have caused an issue, you completely added drive M as an unnecessary failure surface for your node as it would have been critical to the functioning of your node, even though it holds almost no data. And if I understand it correctly, the same goes for how you mounted drive J. While the native implementation to move db’s ensures only none critical data is moved to the other location, your setup would move critical data structures as well. So instead of 1 critical storage drive, your setup had 3. And that’s in addition to obfuscating the actual hardware setup underneath, making it impossible for the node to judge whether your data drive is online or not. The setup was bad all around, this is just the first issue you ran into.

ACarneiro · September 1, 2023, 11:52am

That check is sufficient if your setup is as advised.
I doubt that it would be practicable to check every directory or file for readability every so many hours, the resource usage would be prohibitive.

ACarneiro · September 1, 2023, 12:00pm

I will just finish by saying that I really do hope this doesn’t put you off this project.
Losing 3 years of reputation and data must be a punch to the gonads and if I were in your position I don’t know whether I’d feel like I could be arsed to start from scratch but you’re clearly a very knowledgeable guy/gal and I really enjoyed reading your forensic analysis of what went wrong.

Hope you have a nice and peaceful weekend

Toyoo · September 1, 2023, 12:36pm

This check is sufficient for the current network, so I doubt Storj Inc. will spend their engineers time on inventing something more complex. But you are free to implement something better on your own, the code’s open source.

AndMetal · September 1, 2023, 12:40pm

Please show me where in Storage Node Getting Started - Storj Docs that anything other than 8TB of space is “advised” (like only using 1 drive which has been repeated here multiple times, so presumably no RAID). I’m getting the feeling that all of this is coming from threads buried in the forums, and not actually being communicated via the documentation. And because I haven’t spent time on the forums over the past several years that puts me at 100% fault for not reacting to an issue in less than an hour when the same setup has run fine years prior to this, and the data is 100% intact? Maybe I’m reaching but it kinda sounds like victim blaming when previous posts from multiple others in this same thread agree that 4 hours is not enough time to react to a recoverable issue, let alone the ~1 hour in my case. $8/month is not enough to quit my day job and babysit a single process 24/7.

I’m not saying check every directory or file. Even the filewalker upon startup takes hours on my node with 100% disk utilization (way too many IOPS for a HDD). Just getting a count of files or directories takes several minutes in PowerShell. What I am saying is if the node tries to read or write data because it is downloading new data, serving existing data, or responding to an audit check (which already generates errors in the log, and others have created their own scripts to catch these and terminate the process when they happen), and it starts happening 100% of the time (100% errors vs attempts over X minutes) then obviously something is wrong and the node should terminate. For some reason it was decided that instead of using this information that the node already has we should create a separate check to see if we can read and write to 1 or 2 specific files in a specific location, and if that works then that means everything else is fine. Maybe this is just my frustration showing, but to me that seems hackish and short sighted. It shouldn’t matter what the underlying storage strategy is (with commercial it will likely be SAN, in my case this server started as part of my overall cryptocurrency mining setup which I’ve been running as a sole proprietorship and eventually turned into more of a NAS with several applications running locally, some running on a separate Proxmox node running Docker in an LXC). If the node can’t read or write that’s a problem, and the solution shouldn’t be to continue to fail audits until the node is disqualified. If the data is truly gone, either in whole or in part, that will be caught with the existing suspension and audit mechanisms (either continues to fail audits when brought back up until it’s disqualified, or the node is just left offline until the suspension runs through and the node is disqualified).

I appreciate the sentiment. Truly. Honestly I don’t know what I’ll do from here. I may submit a support ticket like SGC suggested just to see where it goes. While Alexey is obviously a public face for the company, that doesn’t mean they’re the only face, and maybe the sentiment shared here isn’t necessarily the same as the company as a whole (I see this a LOT in other industries, including my own, and I’m usually one of the ones who get to dig into the issue from a holistic perspective to identify and propose fixes for whatever might be broken). While I’ve been just holding all of my STORJ tokens over the years (mainly because it’s stuck in a wallet with no Ethereum to send it anywhere, and Ethereum transaction fees are still really high), the amount I’ve made at least justified most of the electric cost from running the node. If I have to start from scratch, making a few pennies a month for the next year or two, I’m not sure it’s worth it. After electric prices went up in September of last year, combined with Ethereum moving to Proof of Stake, this is really the only thing left going as part of my “mining” business (Storj was never a major part of it, but was a way to utilize some fairly idle hardware). That means I’ll probably have to close that out with the IRS, which sucks because I’ve still got a few years worth of stuff to depreciate.

I’ll do my best. Honestly this has been causing me a bit of stress and lack of sleep, which is silly because it’s such a relatively insignificant thing for me. I think not being able to post last evening because I made too many replies as a new member might have been contributing to that. In any case I hope you enjoy the weekend as well, and if you’re in the US I hope you get to enjoy the holiday weekend.

Toyoo · September 1, 2023, 1:10pm

That might actually be coming from storage nodes as well. My nodes spent on file walker about 8 minutes per terabyte of blobs—at least the last time I measured it. Not being a Windows user I do not have practical experience with storage spaces, but if there is one thing common to people describing their experiences, is that storage spaces are slow compared to regular RAIDs or, sometimes even just single drives. Whether write caching actually helps with that, no idea, probably not with the file walker.

In my area used drives go well on the secondhand market. It’s crazy how I see deals for a 3-year-old drives at 80% of their initial value. Not trying to discourage you from Storj, but I sleep better buying hard disks for Storj knowing that I can recover most of the investment pretty easily.

I agree with this sentiment. I hoped that documentation will improve, suggested what could be added in the past. I gave up though. Documentation was not developed in the open, so I could not just send PRs. I do see there is a repository for that now though, so maybe that part will get better.