Disqualified after 3 years

arrogantrabbit · September 4, 2023, 5:12am

Can confirm. Out of the three nodes, only one gets offline notifications.

I’ve looked at the config files, and in the one that works, email is configured like so:

# operator email address
operator.email: "name@domain.com"

And the ones that don’t – like so

# operator email address
operator.email: name@domain.com

Note: email not in quotes.

Due to a very limited sample size – it’s perhaps a coincidence, but maybe not?

I’m going to encase the email into quotes, offline the node, and check.

edit. reviewing the source, that should not make any difference

edit. yep, in yaml strings don’t need to be quoted.

snorkel · September 4, 2023, 5:49am

@MattJE96011
The email comes hours after the node goes and stays offline, not like with Uptime robot after 5 min. My run command looks like this, no config patameters used:
https://forum.storj.io/t/my-docker-run-commands-for-multinodes-on-synology-nas/22034/7?u=snorkel

MattJE96011 · September 4, 2023, 9:14am

Well it’s not often I have a node go offline for to long, but it has happened and I still never got anything. Like I said though, if it’s hours I suppose it’s not really a big deal anyway so I’m not too worried about it. And I use Uptime Kuma. Even faster than 5 minutes and far more configurable.

AndMetal · September 5, 2023, 1:58pm

You would think, but that hasn’t been my experience. During normal operation without file walker running the array utilization is around 5-20% with read/write speeds varying, but I believe usually sitting around the 1-2 MB/s ballpark. When file walker is running disk utilization is 100% and read/write speeds are in the 0.4-0.7 MB/s ballpark. From what I can recall looking at Process Monitor in the disk usage there were usually dozens of files on the array being accessed simultaneously, but I can’t say if it’s from file walker (presumably not if it’s single threaded), serving of the data, or something else. I’m not sure I ever monitored those numbers after the implementation of the lazy file walker so I can’t say how that has changed with the new implementation.

arrogantrabbit · September 5, 2023, 3:42pm

It’s single threaded, but not throttled (nor should it be), hence it will go as fast as possible. On my nodes it starts at 4000 IOPS from metadata device at full single cpu core saturated, with iops dropping as metadata is ingested into the cache.

That’s because disk has no time to transfer data, it spends all time seeking. Even two randomly accessed files will decimate the throughput. This is the reason that attempting to use a single mechanical HDD to access millions of small files and expecting any non trivial performance is madness.

At least use some block caching solution. Then have filewalker run on start to pre-warm the metadata into the cache. This will subsequently help offload huge amount of IO from the disk. Subsequently, during normal operation, that among other sings will improve time to first byte by eliminating metadata lookup seeks.

Toyoo · September 5, 2023, 7:37pm

Interesting, as this is a different experience from mine. I used to have my desktop as RAID1 set up with mdraid. Maybe there’s another factor in your setup then.

AndMetal · September 5, 2023, 7:52pm

I agree that the whole storage directory structure should be writeable and readable, but that’s not what the read/write check is doing. It is checking that 2 files in the root and only the root of the storage directory is readable/writable, nothing else. Of course it’s unrealistic to expect every single directory and file to be checked all the time, there’s just too many. Maybe it could be extended to check more than it does today, but where does that end? The satellite folders? The folders under that? My point is the node is already indirectly doing this checking as part of its normal function, but it’s not doing anything other than spitting a line to the log file when there’s a problem. Writing generally is to the temp directory (new data coming in), reading generally is to the blobs (stored data), and of course there’s the moving of data between directories. So if the node tries to write something to the temp directory and is failing 100% of the time there is an issue with writes (drive has failed, drive has become disconnected, permissions were changed, etc). If the node tries to read something under the blobs directory and is failing 100% of the time there is an issue with reads. No reading or writing of arbitrary files in arbitrary directories needed.

I think all this suggests is that, if an improved mechanism was implemented (X% failures over Y time), it should be configurable by the user. For example default to 100% failure over 2 minutes, but if the user has a setup that could cause a situation where some of the data becomes unavailable they could decide what those thresholds should be. The point of these checks should be to help node operators know when there is a problem so that they can try and address it, but if it’s something they either choose not to address or can’t be addressed (data loss) then the existing suspension & audit mechanisms will lead to disqualification, as it should.

Admittedly this is getting a little bit outside of my area of knowledge, but my high level understanding is something like NFS might not support exclusive file locks in the same way newer versions of NFS or CIFS/SMB would, but it seems like a bit of a stretch to me to declare that all remote storage options (other than ISCSI) are a non-starter. I believe it can also depend on exactly how a share is connected to, for example options passed to the mount driver on a Linux system. While remote shares are not currently officially tested/supported, I don’t think that should mean they should be outright excluded across the board. Of course baselines like latency, throughput, and IOPS would need to be established to qualify the use of certain implementations (both which type of share to use as well as the connection between the node and the remote storage).

I agree, there is no point to backing up Storj data. I can’t remember exactly when but a few years ago I was rearranging my storage (might have been playing around with different NTFS block sizes which requires making a new virtual disk) so I used rclone to copy all of the Storj data to the new location while the node was still running, which took a few days. I then took the node offline and did a second rclone for any incremental changes which took another few days. I want to say it was for 2-3TB of data at the time, maybe 8-12M files. It might be possible to do in a reasonable time with SSD storage, but to your point you’d be missing anything that’s changed since the last backup so much more likely that the backup data has gaps or isn’t current with any updates. Of course some sort of RAID with redundancy (RAID1, RAID5/6, some of the ZFS implementations, maybe unRAID’s version of parity) wouldn’t be a backup, it would be to prevent losing the node in case of a catastrophic disk failure.

What about other scenarios we’re not thinking about or aware of? I believe I read at least one example where someone didn’t get the subfolder permissions right after moving their Storj data around which caused them to be disqualified. The current check wouldn’t catch it, but if we terminate on large numbers of recent read/write failures that would have been caught. The node operator would be able to fix the permissions and bring the node back up without any data loss.

While that’s a good step in the right direction for any new users setting up a node, I think what we really need is something that outlines acceptable (“one drive, one node”) vs unacceptable setups, as well as the risks to both the node operator (failing audits leading to disqualification) and the network (users having slower access to their data, etc). I would imagine this would be referenced in the requirements, but linked to one or more separate pages that go into detail. By leaving this information hidden inside the forums and the minds of the engineers at Storj we are setting node operators up for failure both in the short term and long term.

Toyoo · September 5, 2023, 9:52pm

What do you mean here?

“Not supported” means “not supported”. You do it on your own responsibility, and if it fails because of a networked file system, the node will not be reinstated. I’ve briefly operated an SMB-based storage node and I know it can work, but I had to do additional engineering and it wouldn’t be Storj Inc.'s fault if I made a mistake doing so.

Storj Inc. does not test this configuration, does not promise it will be reliable, will not reinstate disqualified nodes on SMB.

AndMetal · September 5, 2023, 11:33pm

The node is constantly reading, writing, and moving data (temp, blobs, trash, etc) and it knows if it was successful in doing so or not. If one of those fails it generates an error in the log file, but that’s all it does (aside from not sending data out because it couldn’t read it, etc). Instead of looking at those already generated errors to identify there’s an issue reading/writing (which some users have created their own scripts to monitor via the log file) it was decided to create a separate check to see if 2 specific files in specific locations can be read/written to determine if the node should terminate for read/write issues. I haven’t looked at the code closely enough but the idea would be to have a separate thread create a hook into the logger to grab attempts (requests for data, audits) and failures (read/write errors), store a few minutes of history, and at certain intervals calculate the failure rate. If the failure rate goes from 0 or a small percentage to 100% then terminate the node process. If we can’t hook into the logging process then we would need to send those events to a queue that could be consumed to do the same thing. If done correctly overhead should be minimal and the existing read/write check would become redundant except at start-up of the node, although it should also catch more scenarios and potentially scenarios we’re not even contemplating.

Just to be clear I’m not using SMB or any other remote storage with my current Storj configuration, although that doesn’t mean I wouldn’t want to explore it in the future. I’m already at a point where I’m looking to separate my storage and compute so naturally that would be part of it. At this point all I’ve done is spun up a separate Proxmox box with Docker running in an LXC that mounts a CIFS/SMB share that’s on the Windows computer. Those processes aren’t super IOPS heavy so I haven’t run into any major issues, and actually resulted in much better stability on the Windows computer (which was unexpected).

I don’t think anyone is expecting Storj to test all of the potential combinations of configurations, that’s wholly unreasonable. It makes sense to have some simple setups outlined to help new operators get started, but beyond that all Storj should care about is the ability to meet SLAs: latency, throughput, data integrity, etc. Beyond that Storj should not care how the underlying storage is architected. At that point the focus should be on how to measure those SLAs (setting up a test node, measuring disk performance with something like CrystalDiskMark or equivalent utilities, validating data integrity during certain operations, etc). That would then lead the community of node operators to find different ways to improve performance, in turn improving the end user experience. As it pertains to disqualifications those mechanisms (audits & suspensions) should still be in place to protect the integrity of the network, but I think Storj as a company and we the community need to come to a consensus: should a fixable issue result in near-immediate disqualification? If the answer is no then we need to fix the current read/write check to be more resilient like I’ve outlined above and in some other replies, and possibly rethink the disqualification flow. If the answer is yes then we need to remove the current read/write check and let those nodes be disqualified like they were in the past.

Toyoo · September 5, 2023, 11:56pm

For client downloads, the node does not know whose fault it is, though. We’ve already seen cases where clients were requesting pieces that were already deleted.

For audits, these are sometimes run for experiments by Storj Inc. on outdated metadata. These experiments do not count as audits, but they may again attempt to download pieces that were already removed.

Deletions now only happen on GC, so IIRC once every week. Or, for pieces with expiration date, once an hour, which is still somewhat rarely.

Directories are created on-demand, so if they don’t exist, they’ll just be created.

Do you have any specific operations that happen frequently enough and could be reliably used to test storage?

So you expect a exhaustive list of requirements necessary to run a node. Is that correct?

Alexey · September 6, 2023, 3:42am

There are several independent mechanisms (initiated by different services from the satellite - audits, repair and garbage collectors, from the node itself - read/write checks and filewalkers, from the customers - uploads and downloads).
Only read/write checks performed frequently enough (every minute), all other may take hours and days between accesses, to that moment the node likely will be disqualified already. When the auditor detects that the file is missing, it affects the audit score immediately, so it’s already too late. All other have too wide intervals to be re-used as a checker, they also doesn’t see a difference between “file is missing”/“file is not accessible”. So in case if you really have missing files, your node would be impossible to reliable keep online, it will shutdown itself every time when someone would try to access missing pieces (they will be marked as moved to other nodes only after repair, which may never happen because the repair threshold is not reached).

So, if you did a custom setup, then you need to implement a custom writeability/readability checker, we cannot cover all possible combinations of custom configurations, thus our checkers are standard and expect that the storage location is placed on a one HDD/disk, not splitted between several disks, sorry.
If you want to try to implement an universal checker - please submit a PR, we always glad to accept a Community contribution.

it’s proven multiple times, even if you managed to make it work at the beginning, after a while it will stop to work when the node is grown enough (YMMV): Topics tagged nfs, Topics tagged sshfs, Topics tagged smb

We do not want to have too strict requirements, because it will defeat the idea to use what you have now and what will be online anyway, and do not have investments to start a node. We can only recommend or warn about not supported setups. Unfortunately, your setup is not supported, so I added this warning to the documentation.

AndMetal · September 6, 2023, 2:45pm

That’s why instead of terminating on any error you would look at a short history. With bad clients requests and audits that are expected to fail, is this something we would expect to happen 100% of the time over the course of multiple minutes (higher volume nodes) or X number of requests (lower volume nodes)? My assumption is that while those would be seen as failures, that should really just be a baseline and shouldn’t be causing a 100% failure rate, so I wouldn’t expect that to trigger termination of the node process unless the failure rate is configurable and is set too low. If that is a possibility it may require tuning of the size of the history used and/or weighing “file not found” or is empty vs “IO error” differently as it pertains to calculating the error rate (although we need to be careful to not allow a scenario where something is wrong but the system is returning no data successfully instead of returning an actual error).

Sort of. Not specifically hardware or configuration requirements, but requirements on things such as latency, throughput, and data integrity. For example instead of saying you should have this type of drive and it should be connected this way (SATA, SAS, USB, directly connected, etc) it should be based on things like IOPS under certain load, read/write speeds under certain load, how long it should take to find and read a piece of data, how long it should take for incoming data to be moved to blobs, etc. Synthetic benchmarks could then be used to determine if a given setup meets those performance requirements (likely using 3rd party tools, unless Storj or the community decides it would be worth it to come up with a utility to test these specific metrics). This could even go a step further to look at network connectivity to the different satellites to ensure the connection is reliable. I’d expect that to lead to certain more common configurations that aren’t suitable (maybe a 2.5" low speed HDD connected via USB2), but leave the door open for other less typical configurations.

Toyoo · September 6, 2023, 9:16pm

Eh, good luck determining the ratio of errors that would make sense, then. This system is still evolving. There are changes in the node software, in the satellite software, in the client software, and new customers keep coming, with different usage patterns.

Besides, given that any mistake in selecting a wrong threshold level could result in a customer losing data, Storj Inc. needs a conservative approach.

So, yeah, these are stated on the forums—not sure about the docs though, @Alexey can probably confirm.

latency: respond within 5 minutes for valid requests,
throughput: have at least 25 Mbps in downlink and 5 Mbps in uplink,
data integrity: do not lose more than 2% of data.

But given that customer access patterns and node software are also evolving, and some of these also depend on the number of nodes in the network, the exact parameters for storage change. For example, there was a recent change in efficiency of the file walker process making it require less IOPS. On the other hand, we had some request spikes that required more IOPS for efficient processing. Storj Inc. likely does not have the capacity to keep measuring the exact threshold levels, them changing would make some nodes ineligible, and various less trivial setups would require determining separate thresholds for IOPS for databases, and for blobs, making it even more difficult to measure.

Yet somehow we have 20k nodes that work correctly despite these requirements are not as precise. This means that they are probably far beyond the minimums, so stating them might not make much sense—and would be a waste of time to even measure.

Alexey · September 7, 2023, 3:18am

the target uptime - not more than 5 hours offline/month
the controlled uptime - not more than 12 days offline: Changelog v1.12.3 - #8 by littleskunk
after 4 hours offline all pieces considered as unhealthy and may be repaired to other nodes, so longer your node is offline, the more data will be moved to other nodes, it will be deleted from your node by the garbage collector later.