Disqualified after 3 years

AndMetal · August 30, 2023, 11:53pm

Well this is awkward. I started up a node just over 3 years ago and after a Storage Spaces drive briefly dropped today I’m now disqualified on all satellites.

Earlier this afternoon I got an Email that I was disqualified on the AP1 satellite. I checked my server and it appears the Storage Spaces drive I have dedicated for Storj (mirrored array with some SSD write cache) went offline. Brought it back online no problem, data is still there, dashboard showed no issues (suspend and audit at 100%, uptime in the 99.8% ballpark). AP1 was only a fraction of my storage so I figure oh well, maybe I’ll set up a second node at some point to get AP1 back. I just checked the dashboard and now all satellites are disqualified with a 95.98% audit and 100% suspension. I didn’t restart the node after bringing the disk back online so I’m thinking Storj wasn’t seeing the disk even though it was back.

It’s my understanding that once you’re disqualified that’s it, there nothing anybody can do to remedy that even if all the data is there and intact (in my case about 6.6TB). What should I do now? Do I start a new node from scratch? Is there anything I can do with the existing data or is it trash now even though it’s intact? Is graceful exit even an option and does it make sense to do that?

I’m also concerned about why this happened. Could it have something to do with how I have my file system structured? Specifically I have a “mount” drive with junctions to different arrays (striped, parity, mirrored, differing levels of cache, etc). I then have Storj pointed to a directory on drive M (the mount drive) with a junction to drive J (mirrored array with a large NVME tier and the database files pinned to the NVME) for the node_storage directory, and another junction to drive H (another mirrored array without tiering) for the blobs and such. I’d hate to set up a new node on the same system that has been running smoothly for years just to end up in the same situation down the line.

SGC · August 31, 2023, 7:58am

you should make a support ticket and get your node restored.
there are exceptions to the general rule that once DQ its over.

it’s just that those cases are generally so rare, that most people don’t fall in that category.
good luck

Alexey · August 31, 2023, 8:17am

Hello @AndMetal ,
Welcome to the forum!

If the node is disqualified, it’s permanent. Since it’s disqualified on all satellites, you may only generate a new identity, sign it with a new authorization token and start from scratch.

@SGC no need to file a support ticket, there is no known bugs, so we would be unable to reinstate this node.
The reinstating is always a risk for the customers’ data. In this case it’s a software/hardware issue, no some bug.

@AndMetal Please do not use junctions, especially for different drives.
I wonder, why is your node did not stop, if it cannot pass a writeability/readability check? Did you change these timeouts?

Toyoo · August 31, 2023, 8:44am

Storage node code has a safety feature where it attempts to write (and read) to the storage directory to make sure that it’s still operational. If the attempt fails, the node shuts down, so that the node won’t be disqualified.

Your junctions have defeated this safety feature. Your storage directory was operational, but not the blobs subdirectory.

BTW, I really wonder how your setup was even operational before the disqualification if the temp directory was on a different file system from the blobs directory. Storage node uses a move operation between these two directories, which usually fails between different file systems unless a full file copy is explicitly done.

Vadim · August 31, 2023, 8:48am

it can be because He used cache, and writability was written to cache.

SGC · August 31, 2023, 9:30am

If one of my nodes died to no fault of my own after years of operation.
i would been kicking and screaming to get it reinstated…

ofc the issues is that most people think their nodes to unjustly DQ, when infact it usually is the case that the node was bad.

i can’t say what happened, but if he truly thinks his node was DQ and it shouldn’t have been, then he should fight to get the issue addressed…

even if he might not get his node back… else issues that exist might never be solved.

AndMetal · August 31, 2023, 9:36am

Interestingly I experimented a while back with putting the temp directory on the same NVME-backed mirrored array as the database files. The result was the data would download fine but wouldn’t move to the blobs. So I was still able to serve data that had already been received, but wasn’t adding any new data. This didn’t affect my suspension or audit scores, just my ability to increase the amount of data I had. However from what I can tell the garbage directory doesn’t seem to be affected by this. At the time of disqualification the blobs, temp, and trash directories were all pointing to the same drive and had been for a while (maybe a year or more?).

AndMetal · August 31, 2023, 9:59am

Not that I’m aware of. These are the only values in the config file that were changed from their defaults:

console.address
contact.external-address
filestore.write-buffer-size
identity.cert-path
identity.key-path
log.level
log.output
operator.email
operator.wallet
server.address
server.private-address
storage.allocated-disk-space
storage.path

Could there be an issue with how the writability/readability checks are done? I haven’t dissected the log file yet (one of my biggest complaints is Storj doesn’t do any splitting/rotating of the log file, so since I don’t actively monitor it and it’s set to info it’s sitting at about 13.4GB since I last wiped it in March), but I would expect an attempt to read the blobs directory to either fail or return an empty directory listing. Is that not sufficient for the node to stop to prevent itself from being disqualified? I’ve occasionally had some issues after a reboot where either a Storage Spaces drive doesn’t come up after a reboot or there is enough of a delay that programs are running before they come up, and in those situations (where node_storage wouldn’t have been available) the node appears to have stopped and I’ve received offline notification Emails which then prompt me to investigate, assuming I haven’t caught it already.

AndMetal · August 31, 2023, 10:07am

In this specific scenario the blobs, temp, and trash directories were all pointed to the same mirrored array, although that specific one doesn’t use tiering for with the NVME drives, just some write cache assigned by Storage Spaces by default (something like 10-20GB?). Since the array was offline I would expect reads and writes to any of those 3 directories to fail, with reads either resulting in some sort of error or just showing an empty directory structure. I did experiment a while back with having the temp directory pointed to a different array that utilized tiering with NVME drives in the SSD tier, but those did fail to move to the blobs directory (no affect on suspension or audits, still able to serve existing data, just wasn’t able to receive new data) so I abandoned that.

AndMetal · August 31, 2023, 10:23am

That was kind of my initial reaction, but I’m trying to keep a level head and stay logical and reasonable about this. The worst part is my node isn’t even dead, the node itself is online, all the data is still there, the underlying drives are good, and I even still have data in the temp directory. If audits were done now I’m sure they would pass with flying colors. I mean I understand why the systems are in place, but I can’t help but feel they didn’t work the way they should have in this scenario.

BrightSilence · August 31, 2023, 11:35am

@Toyoo explained why the check didn’t work here.

The read/writeability check uses a file in a folder higher up from the blobs folder. Since your blobs were offline, but this folder wasn’t, the node assumed your data was in place and kept running. If you want to move databases to an SSD, there is a built in function for that. You don’t need to (and shouldn’t) use junctions. By using junctions, you’ve obfuscated the actual HDD hardware underneath and split up things the node assumes are on the same HDD/Array. If you would have used the built in feature to move just the db’s to the SSD, this wouldn’t have happened.

I’m sorry to say your setup was the culprit for this disqualification, so there doesn’t seem to be any grounds to reinstate your node. This is exclusively for bugs on Storj’s end, which is not the case here. In general there is no use promoting that option to begin with as those bugs are very rare and it’s usually something caused by the node operator or their setup, if not failing hardware.

AndMetal · August 31, 2023, 12:52pm

Is there any documentation on the different configuration options? I did some digging both on the Storj site (support, documentation, etc) and the Github project but didn’t really find anything useful other than a handful of command line options. I did notice this in my config file:

# directory to store databases. if empty, uses data path
# storage2.database-dir: ""

BrightSilence · August 31, 2023, 1:12pm

Yes, that’s the one.

You would have to first stop your node. Change the setting to point to a location on your SSD. Copy the .db files to that location (I recommend copy so that the original is still there in case something goes wrong during your initial setup). And then start the node again.

After that check that it’s actually using the db files in the new location, by looking at last access times or seeing the .db-wal files while the node is running. If so, you can remove the db files from the old location.

As for documentation, it can be a little limited. But you can look through the config file and if you have questions. Search the forum or the Storj knowledge base for further info. If you can’t find what you’re looking for. Just ask in a new thread. The community as well as Storj is pretty responsive and helpful.

AndMetal · August 31, 2023, 2:04pm

I’ve had a brief opportunity to parse through some of the log file, and there are some interesting things worth noting. First a timeline:

13:51: Disk went offline according to Event Viewer. Storj node log starts showing empty new lines, prior to that it was successfully uploading and downloading pieces.
15:04: Received Email that AP1 was disqualified.
15:04: Received Email that EU1 was disqualified.
15:15: Received Email that Saltlake was disqualified.
15:50: Received Email that US1 was disqualified.
15:55-16:15: I saw the AP1 Email, checked the server, and brought the disk back online. Dashboard was still showing 100% suspension & audit for all satellites (presumably due to the default value of nodestats.reputation-sync which appears to be 4 hours).
19:03: I manually restarted the node after seeing H drive usage was at 0% utilization, dashboard started showing a warning that the other satellites were disqualified and the audits are ~96%.

For reference, here is the reputation information from the log after the reboot when the satellites showed as disqualified:

2023-08-30T19:07:53-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Total Audits": 412461, "Successful Audits": 394329, "Audit Score": 1, "Online Score": 0.9981712725060272, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:53-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 953073, "Successful Audits": 927705, "Audit Score": 0.959809440525067, "Online Score": 0.9989558320629611, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 1297653, "Successful Audits": 1269795, "Audit Score": 0.9598094405250738, "Online Score": 0.9988364383003573, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 1765957, "Successful Audits": 1728037, "Audit Score": 0.9598094405249591, "Online Score": 0.9981104464364042, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:54-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 1386690, "Successful Audits": 1353832, "Audit Score": 0.9598094405250738, "Online Score": 0.9982045692256415, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T19:07:55-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Total Audits": 1979614, "Successful Audits": 1936861, "Audit Score": 1, "Online Score": 0.9958238630652381, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}

and about 20 minutes before the disk went offline:

2023-08-30T13:32:07-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Total Audits": 412461, "Successful Audits": 394329, "Audit Score": 1, "Online Score": 0.9981712725060272, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:07-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 953023, "Successful Audits": 927696, "Audit Score": 0.9999999999999925, "Online Score": 0.9989558320629611, "Suspension Score": 1, "Audit Score Delta": 0.0000000000000008881784197001252, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:07-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 1297593, "Successful Audits": 1269777, "Audit Score": 1, "Online Score": 0.9988364383003573, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 1765905, "Successful Audits": 1728026, "Audit Score": 0.9999999999998772, "Online Score": 0.9981104464364042, "Suspension Score": 1, "Audit Score Delta": 0.000000000000012989609388114332, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 1386632, "Successful Audits": 1353815, "Audit Score": 1, "Online Score": 0.9982045692256415, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2023-08-30T13:32:08-04:00	INFO	reputation:service	node scores updated	{"Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Total Audits": 1979614, "Successful Audits": 1936861, "Audit Score": 1, "Online Score": 0.9958238630652381, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}

so it looks like it failed ~40 audits (presumably in a row, total audits were 50-60 per satellite during that timeframe) over the course of an hour or so before being disqualified. Is that a reasonable expectation of time to react to and fix a correctable, non data compromising issue? Why shouldn’t this result in a suspension and notification first? Or at a minimum have the node go offline if it is failing multiple audits in a row to help save itself from being disqualified before it can be addressed by the operator?

jammerdan · August 31, 2023, 2:36pm

This seems to be working exactly like intended:

littleskunk · August 31, 2023, 3:58pm

Yes it is the expected behavior in this case.

Because you disabled that feature on your end. The read and write checks would have terminated your node while the score was still good and would also prevent any restart until the root cause has been fixed. This would have triggered an offline email after a short time. I believe there are even reminder emails if the first one gets ignored. After a few days offline suspension kicks in. All that didn’t work in your case because you effectifly disabled that check.

Balage76 · August 31, 2023, 5:56pm

I had the same kind of issue in the past, but it seems that I was lucky and the node was not disqualified.
My case was like this: I had 2 nodes running on a windows machine. Both nodes’ db and log file was stored on the system SSD, while the first node’s data folder was on an internal hard drive, the second node’s data folder was on a USB HDD.
It happened two times that the USB HDD somehow disconnected, not phisically as it was still spinning, but was not reachable by the OS. At both times the node’s log file was full with read/write and audit errors and the audit score was close to 60%. If I remember correctly, my node had the problem for about 4-5 hours before I noticed it.
As soon as I powered off-on the PC the node started normally and within 1-2 days the audit score went back to 100%.

So in my case the node was somehow still running, as it was communicating with the satelites and I was able to load the dashboard, but the data was not available.
Maybe the same happened here as well.

BrightSilence · August 31, 2023, 6:12pm

Must have been before the read/write check was implemented then. That should no longer be possible unless you have a complicated junction system like OP.

AndMetal · August 31, 2023, 6:25pm

What I’m gathering from from some other posts linked to that (Tuning audit scoring; Disqualified. Could you please help to figure out why?; Put the node into an offline/suspended state when audits are failing) is that 4 hours is unacceptable, let alone 1-2, and for cases like these operators should have days if not weeks to identifty or be notified of an issue and fix it, assuming it’s just a semi-hung system or loose cable and not true data loss. It also sounds like well established nodes are more likely to run into this because more data = more audits. But it also seems like these concerns have gone nowhere 2 years later.

jammerdan · August 31, 2023, 6:51pm

When the node has a defect, couple of hours to solve it is not enough. I have always said that. People need to sleep, they have jobs, family, vacations. So yes, getting disqualified within hours I don’t think that’s how it should be.