Disqualified after 3 years

BrightSilence · September 1, 2023, 2:39pm

The issue I have with your responses is that you fail to ever acknowledge your part in this. The documentation also doesn’t advise against using a RAM drive to store your blobs. Or using cloud storage oud storage. Or using a large array of floppy disks and CD burners, tied together using clever software. You can’t expect them to tell you explicitly to not use every possible exotic setup people can come up with. But it does advise to use one HDD per node. Which you also didn’t do. Instead of acknowledging your setup as the issue, you again point to having to respond within an hour. Which… you don’t, IF you have a reasonable setup. So that entire argument is moot. Nobody expects you to respond within and hour, they expect you to run a stable setup without jank. So you can stop referring to old topics written before the solutions we have today were put in place. As you have seen, I was a frontrunner in making that argument. I am not making that argument anymore and neither is anyone else since the fix. So you can point to solved old discussions all you want, but it won’t help you much.

Might that be the reason I already outlined twice for you and you conveniently ignored both times?

You’re right, it doesn’t. The only thing that matters is that the underlying storage strategy is stable and reliable. Yours wasn’t.

Luckily that was fixed with the read/write check. So no longer an issue.

And how do you suppose they would differentiate between your scenario and data being truly gone? Because to the node, it was truly gone. Or do you expect them to wait for over a month, put data at risk by not triggering repairs that are needed, when the node clearly shows the HDD is accessible, but the data is gone?

I mean, that sucks and I’m sorry. It’s really not worth losing sleep over. I take it my responses probably aren’t helping, so I’m sorry about that. I’m just trying to be reasonable here and I just don’t see any fault of Storj in this scenario.
So yeah, please enjoy your weekend and take some time to relax.

AndMetal · September 1, 2023, 4:25pm

It’s in blobs, garbage, temp, & trash inside the storage location. While this may seem like arguing semantics, I feel it’s an important distinction. Unless you don’t move the database storage I don’t believe there’s any actual data stored in the root of the storage location. If you want to test writes, check temp. If you want to test reads, check blobs. If you want to test data moving between the two, check both. To be clear I’m not suggesting there should be a test file in each of those directories (what if you’re using mergerfs behind the scenes and a disk drops, resulting in a temporary “loss” of some but not all of the data). The node is already doing this naturally, it just isn’t doing it by way of a test file, but rather actually working with the data files. While it may have been actively decided to specifically require everything to be on the same disk (super fast when moving data around), from digging around the code a little bit it looks like it’s just as a result of how it’s implemented (os.Rename for Linux: storj/storagenode/blobstore/filestore/dir_unix.go at c9591e9754d1fb229b9e2b2a0d75013901fed647 · storj/storj · GitHub; MoveFileEx for Windows: storj/storagenode/blobstore/filestore/dir_windows.go at c9591e9754d1fb229b9e2b2a0d75013901fed647 · storj/storj · GitHub). In theory, assuming you don’t have to worry about race conditions if the file exists in multiple locations during the move and the underlying file system supports it, this could also be accomplished with hardlinks (still same disk/partition) or copy/delete (different underlying storage). If copy/delete is implemented then you could use anything that can be mounted (local disks, remote shares, cloud storage, distributed storage like Ceph, RAM for temp, etc).

To be fair, it was until it wasn’t. And even then the data is still there and intact, it’s not missing or corrupted. It went missing from the network for ~2 hours. If 3 years of no major issues with hundreds of gigabytes ingresses & egressed each month isn’t stable and reliable, I’m curious as to what is.

Unfortunately it’s an issue for me. The check looks for a specific scenario (can’t read/write to node_storage) which according what you’re saying accounts for the majority of issues (assuming everyone who runs into disqualification comes to the forum about it instead of either letting the node die or restarting a new one), but that doesn’t mean all scenarios are accounted for (otherwise I wouldn’t be in this situation right now). The symptom as the node sees it in all of these scenarios is the blob data can’t be read because it is temporarily unavailable (the data itself is fully intact, but can’t be accessed by the node). Obviously the node doesn’t know it’s only temporary, that’s fine, there’s no way for it to know and I don’t think anybody expects it to. However if a check was implemented to prevent nodes from being disqualified in these scenarios, and that check isn’t working in all of these scenarios (even though it is working in most scenarios), does that mean at a conceptual level that there should be a concern if new scenarios or edge cases are found that run into the same issue (data temporarily not available, “missing” data is intact and complete, issue can be resolved quickly with manual intervention, but disqualification happened before the intervention could happen)? I’m not trying to be mean when I say this, but from the reading I’ve done over the past day or two and a few of the sentiments shared here seems to support that, but what you’re saying here seems to contradict that compared to things you have said in the past before this read/write check was implemented.

Let’s run through a few scenarios.

First scenario, data is corrupted (failing drive, ransomware, etc). Assuming nobody backs up Storj data (last time I rsync’d it when moving some stuff around it took like 3.5 days, and that was with about 2/3 of where I’m at now file count and size wise) this is unrecoverable. Really just business as usual, it will fail audits and eventually be disqualified (quickly for larger nodes, slower for smaller nodes). Makes sense, no issues there.

Second scenario, drive has completely died and all the data is gone. Depending on the type of failure this would likely fail the read/write check and cause the node to terminate. If the node stays offline my understanding is it will be suspended for being offline after a few days, and after a few weeks be disqualified. Sounds fine, although ideally if it’s known that the drive is dead and unrecoverable that could be reported back to the network so repairs can be done and life can go on.

Third scenario, drive is fine but communication is failing (temporary issue, can be resolved with a reboot, fixing/replacing a cable, etc). If it fails the read/write check then it would be the same as scenario 2, going to suspension after a few days of being offline and eventually being disqualified after several weeks if nothing is done. If the issue is fixed and the data is back, I’d expect to see a temporary drop in the suspension score and possibly a small drop in the audit score. If this keeps happening (bad cable wasn’t replaced, operating system is failing, etc) it will continue to go down continuing to decrease uptime and potentially failing some audits. If the read/write check is successful then it’s the same as scenario 1, disqualification after 40 failed audits.

What that’s effectively saying is if a drive completely fails and all files are gone we’re okay with taking over a month before that node is given up on (worst case scenario, node process is terminated and never brought back up). Similarly if a node has a temporary issue reading the data and the read/write check catches this then they would have the same amount of time to remedy the issue and bring the node back online. In both cases the node operator would be notified that the node is offline, giving them something to react to (assuming they check their Emails). And if the drive is dying resulting in corrupt or partially missing data (so not catastrophic failure) that would start driving down the audit score and should be somewhat proportional to the amount of damage to the data. If the drive is really bad that should happen somewhat quick (hours or days?), if the drive is only sort of bad it could take longer (I believe the changes you proposed and were implemented took this type of info into account based on % of bad data, so you probably have a better feel for how long that would take). I believe in these scenarios the node at least has the ability to attempt a graceful exit, which could be successful if the damage isn’t too bad. However in the scenario of a recoverable issue where the read/write check hasn’t caught it you would only have as long as it takes to receive 40 audits and then you’re done for good. So even the true catastrophic scenarios have quite a bit of time to react and decide what to do.

I think my goal is to answer this: if a completely dead (no data) or temporarily offline drive would have weeks to react, identify, and remedy an issue, and the read/write check didn’t catch it, is it fair that others that weren’t caught by the check should have similar time to react? If the answer is yes then my reaction would be to reach out to support to have the disqualification removed and work on a proposal to enhance the read/write check so that it accounts for large amounts of blobs/temp read/write failures over a short period of time either to replace the existing check or to work alongside it, preventing similar disqualifications from happening in the future. If the answer is no then I would probably push for suspension vs disqualification based on the type of audit failure (failure to send data vs bad hash) and decide if I want to continue with the project.

I appreciate that. I’m trying to stay level headed through this, but I feel like some of my responses may be coming across as snarky, abrasive, etc, and if they are I’m also sorry about that (to both you and the others responding here). You’re obviously a great asset to the community, and appear to have been for many years. Even if we’re not on the same page, I appreciate you taking the time to respond here.

AndMetal · September 1, 2023, 6:32pm

Honestly I’m not sure exactly how metadata is stored with Storage Spaces when SSD/NVME is involved. For the array I have Storj data on (shared with some other things in the past, most have been moved to other arrays) it’s set up as a basic mirrored array, which means it’s thinly provisioned, or that it will only use the space it needs from my entire disk pool (105TB) up to a limit determined by the block size you set when you create it. I have another array that uses tiering, where I can actually set aside specific amounts of high speed storage that is part of the pool, and the most frequently used data will be stored on the SSD (either in part or in full, unless you pin specific files it generally works at the block level not the file level). The downside is while you can increase the size of any disk/array, it’s all preallocated. So if I provision 1TB but only put 1GB of data on it the remaining 1023GB won’t be available to the pool for other uses. While Storage Spaces will utilize some space from any SSDs/NVMEs for write caching, I have it set to the default which if I remember correctly is somewhere around 10-20GB. So it will still have to move the data from there to the HDD in the background, but it allows files to be moved faster (up to that limit) and slowly fed to the HDD. All of this is to say I could see tiered arrays storing metadata on the SSD/NVME tier and not necessarily in high speed storage for a thinly provisioned array or one that isn’t using tiering, but I really can’t say for sure.

Performance wise I haven’t had any issues. Going from memory, most of my HDDs (all connected via USB3) get around 210 MB/s during single, sequential transfers on their own, although that drops down to ~0.6 MB/s with heavy random reads. I have some “simple” arrays (think RAID0) that peak around 400-600 MB/s when striped across 3 disks. I also have what I call a “fast scratch” (like a big temp drive) that has about 200GB of NVME backed by 2TB of HDD and it sees something in the 1,600 MB/s range as long as it doesn’t have to fall back to the HDD tier, after that I think it’s somewhere in the 400-800 MB/s range. I’ve tested the speed of a RAM drive and it was also in the 1,600 MB/s ballpark. Of course this can all vary depending on the load since the underlying drives are currently shared between the different arrays (IOPS are spread out).

Storj is really just a relatively small part of my data (like I mentioned, I’m working with 105TB raw with 5 more 14TB drives sitting on a shelf waiting to be used, Storj was using ~13TB after mirroring). Unfortunately most of it isn’t tied to “mining” these days and I still need the storage, so unless something changes in the crypto space I won’t have any more business income to claim.

That looks promising. My free time is spread a little thin these days, but if that’s what’s driving the documentation site that may be a good place to start since I’m not super familiar with Go and there’s a LOT to understand with the entire codebase before I would even start to consider submitting a potential code change.

BrightSilence · September 1, 2023, 6:33pm

And it would also duplicate IO for any normal setup. The point is, it’s unrealistic to expect the node to assume that any folder could be on different storage mediums. Especially since there is literally no reasonable reason to do such a thing, other than for the db files. Mergerfs or similar solutions would just make your node less reliable and you should be running a node per disk instead. And should such a scenario occur, it wouldn’t even be 100% failures anymore, so to account for that with your suggestion would mean also shutting down the node when partial failures occur? All of these setups are not recommended for good reason. Even if the node could survive temporary failure of a disk, the chance of permanent failure is also much higher due to multiple critical disks being involved. These are the kinds of setups Storj doesn’t want, so it stands to reason they will nod make adjustments and spend engineering time to make checks to explicitly support them.

Yeah, but so is striped RAID. Those can easily run for 3 years as well. But there’s a reason it’s often referred to as scary RAID. It doesn’t say all that much.

Probably not, but you read the old topics. There was at the time a flood of those reports. And that flood has entirely dried up. I ofcourse can’t say for sure that you’re the only one running into this issue since the fix. But you’re the first for whom it hasn’t worked to mention it here on the forum at least. And as mentioned before, there are other issues with the setup you chose.
Perhaps I would have more of an agreement with you if you had a setup that would be a far superior way to run a node. But it isn’t. It’s one that adds unnecessary failure points that could bite you in the butt in many more ways than this one. And to be honest, I can’t think of any setup, for which the current solution wouldn’t work, that wouldn’t have the same issue of additional points of failure.

Don’t worry, I don’t think you’re mean. It’s just that, yes, I still have that concern, but it’s been resolved for any reasonable setup. So I don’t think it makes sense to refer to old topics when there was no solution at all. Developer resources cost time and money, which is better spent elsewhere instead of trying to “fix” a check so that it would also work with less reliable setups. The better fix would be to perhaps have clearer warnings not to mix HDD’s for the storage location in any way. Since those setups are far from optimal to begin with.
So yeah, it’s not my opinion that has changed, it’s the context. Airing our opinions at the time triggered Storj to fix it because it was hitting and harming reasonable setups. I’m not seeing the issue you had effecting reasonable setups right now. That’s the difference.

This situation is already not ideal. Then again most drive failures are not that instant and DQ will probably preceded it anyway. But keeping the node around will actually end up with Storj paying for a node that is already broken for another month and data not being repaired as quickly.

Guess I saw this argument coming with my previous response. I’m not really okay with that, but there isn’t really a much better solution for that scenario.

I get your point, but it’s the wrong question to ask here. The real question is, what percentage of nodes are not protected by the current solution and are those nodes the kinds of setups Storj would want to protect to begin with? I think I’ve been clear about my stance on that with your example. And I believe the current solution protects all nodes worth protecting. (That’s also not meant to be mean towards you btw, I just think the way you approached it has major flaws)

I have the same problem in the way I talk. As long as someone is writing arguments and passionately defending their stance, I’m perfectly fine with anything they say. So you’re A okay in my book.

I appreciate that and I commend you for standing up for your point. Though I wish you had found this forum earlier and discussed your setup. I know it’s far from reasonable to expect that from anyone running a storage node. But it sucks to know that this great community and the active people from Storj Labs could have warned you before it was too late. I hope you stick around. Despite the unfortunate circumstances, I think this is a great place and it’s awesome to have more smart people like you hanging around here to discuss these topics. It’s why I keep coming back here in my free time. Honestly it’s probably more than half of the fun of running nodes for me.

Toyoo · September 1, 2023, 6:55pm

From my observations the usual performance bottleneck for node operations, and certainly for file walker, is random I/O, not sequential. This is the number closest to become meaningful in this context, then. Though, interpreting it requires knowing how the measurement was made: was it directly on the block device? single large file? across many small files, with filesystem lookup included?

AndMetal · September 1, 2023, 7:11pm

1,000% yes, bottlenecks related to random read IOPS is the biggest pain point.

That was from a CrystalDiskMark test I did on a single drive (before adding it to the pool) before adding data to it, and with nothing else being utilized. I actually saved the results so I could refer back to them:

------------------------------------------------------------------------------
CrystalDiskMark 7.0.0 x64 (C) 2007-2019 hiyohiyo
                                  Crystal Dew World: https://crystalmark.info/
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
Sequential 1MiB (Q=  8, T= 1):   212.235 MB/s [    202.4 IOPS] < 39366.03 us>
Sequential 1MiB (Q=  1, T= 1):   210.965 MB/s [    201.2 IOPS] <  4966.50 us>
    Random 4KiB (Q= 32, T=16):     0.893 MB/s [    218.0 IOPS] <625353.10 us>
    Random 4KiB (Q=  1, T= 1):     0.629 MB/s [    153.6 IOPS] <  6495.05 us>

[Write]
Sequential 1MiB (Q=  8, T= 1):   213.711 MB/s [    203.8 IOPS] < 39036.42 us>
Sequential 1MiB (Q=  1, T= 1):   213.900 MB/s [    204.0 IOPS] <  4896.47 us>
    Random 4KiB (Q= 32, T=16):     7.084 MB/s [   1729.5 IOPS] <233402.27 us>
    Random 4KiB (Q=  1, T= 1):     7.622 MB/s [   1860.8 IOPS] <   536.54 us>

Profile: Default
   Test: 1 GiB (x5) [Interval: 5 sec] <DefaultAffinity=DISABLED>
   Date: 2020/10/31 16:03:14
     OS: Windows 10 Professional [10.0 Build 18362] (x64)

I also just realized I said 0.6 MB/s when that was actually the single random read, not the heavy one (Q32,T16) which was surprisingly a little higher at just under 0.9 MB/s. If I remember correctly when filewalker is running it usually sits in the 500-700 KB/s range when monitoring the drive with Task Manager.

If you’re interested I have at least some of the other tests I ran, although I’m not sure I have any for the basic mirrored array, most of it was testing performance on a parity array to tune performance (different number of disks in a “group” aka columns in Storage Spaces) as well as the high performance arrays.

Toyoo · September 1, 2023, 7:33pm

150 is a good number here. Theoretical maximum for a 7200 RPM drive is 250. It makes sense that this number goes up on concurrent reads due to NCQ. The drive can reorder operations if it figures out that, let say, the head is closer to another requested sector.

However, what would be interesting would be IOPS on the storage space itself. IIRC Microsoft has this tiny tool to measure these things.

Pac · September 1, 2023, 10:13pm

Initially I really agreed with:

This has always felt unfair to me too.

But this entire thread is full of clever thoughts here and there, so I’m not sure how I feel anymore towards that matter now

I do agree with @BrightSilence!

Alexey · September 2, 2023, 1:36am

Yes, we now have a GitHub repository for the documentation (GitHub - storj/docs: Source for Storj DCS docs) and it’s finally a source of truth for our documentation, and PR are always welcome!

Alexey · September 2, 2023, 3:00am

Wouldn’t help, since you may use junction again, so it will check one of four disks. The node expects the whole storage directory to be writeable and readable.

This will lead to a quick disqualification, if it lost more or equal than 5% of data. MergeFS, RAID0, JBOD, … are mean inevitable disqualification with one disk failure.

This will not work because of difference in implementation of file locks. So none network filesystems are supported. The only working network storage is iSCSI (but may lead to disqualification due pieces corruption or prolonged timeouts 3x by 5 minutes for each piece).

The backup is useless, your node will be disqualified as soon as you bring it online after restore because of losing data since backup. So, your backup must be synchronous. I do not know, is it possible to make it work fast enough. RAID1 is not a backup tool.

Data will be considered as unhealthy after 4 hours offline. So, it will be repaired, if the number of healthy pieces is below a repair threshold. And when you bring your node online, this data will be removed by the garbage collector from your node.

All your scenarios are covered by the current implementation, if you didn’t use junction for blobs, trash, temp subfolders of the data location.

I would add another warning to the documentation to explicitly say that splitting the storage location will result in quick disqualification if some of subfolders would become offline.

MattJE96011 · September 3, 2023, 12:53pm

I’ve never seen a node offline email. Didn’t even know it was a thing. Is it something that has to be enabled?

littleskunk · September 3, 2023, 3:52pm

Kind of. You just have to configure the optional email address in your node. If you want to test it maybe just install an old version for a moment. That should also trigger an email without any negative consequences for the node other than no uploads for a few minutes. Triggering the offline email on purpose wouldn’t be a good idea since that comes with additional penalties.

snorkel · September 3, 2023, 4:53pm

Yes, it was enabled some versions ago… you receive an email from each sat, at different moments, after x hours of being offline. Of course, you must put the real email address in the config or docker run command.
@littleskunk - It was nice to receive an email when the GE is finished also.

AndMetal · September 3, 2023, 6:41pm

I haven’t installed DiskSpd yet, but here’s a CrystalDiskMark report against the mirrored array (the simple one I use for Storj with some default write cache, but no SSD tiering). Nothing was running against that array at the time, but there is some other activity within the pool (nothing crazy, but it’s not nothing like the single disk test was).

------------------------------------------------------------------------------
CrystalDiskMark 7.0.0 x64 (C) 2007-2019 hiyohiyo
                                  Crystal Dew World: https://crystalmark.info/
------------------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

[Read]
Sequential 1MiB (Q=  8, T= 1):   411.870 MB/s [    392.8 IOPS] < 20310.69 us>
Sequential 1MiB (Q=  1, T= 1):   153.138 MB/s [    146.0 IOPS] <  6839.75 us>
    Random 4KiB (Q= 32, T=16):     3.924 MB/s [    958.0 IOPS] <327917.89 us>
    Random 4KiB (Q=  1, T= 1):     0.751 MB/s [    183.3 IOPS] <  5438.36 us>

[Write]
Sequential 1MiB (Q=  8, T= 1):   163.405 MB/s [    155.8 IOPS] < 50900.37 us>
Sequential 1MiB (Q=  1, T= 1):    99.159 MB/s [     94.6 IOPS] < 10528.39 us>
    Random 4KiB (Q= 32, T=16):    12.438 MB/s [   3036.6 IOPS] <165474.86 us>
    Random 4KiB (Q=  1, T= 1):     1.537 MB/s [    375.2 IOPS] <  2661.68 us>

Profile: Default
   Test: 1 GiB (x5) [Interval: 5 sec] <DefaultAffinity=DISABLED>
   Date: 2023/09/03 14:29:17
     OS: Windows 10 Professional [10.0 Build 19045] (x64)

Heavy random reads are WAY better, which is part of the reason I’m using it in the first place (in addition to not losing the node if there’s drive failure). I believe it’s a 2 column mirror (originally created it when there were 4 disks I think) so as long as the data is balanced correctly I could lose 4 of the 9 HDDs and 1 of the NVMEs in the pool without losing data.

MattJE96011 · September 3, 2023, 10:42pm

Is it just a config line that needs to be added? All my nodes were created some time ago and only have necessary lines in config files. I do use my actual email and have received an email one time a while back about a sat suspension but that was it. If the emails take hours I’m not sure it will help me much as I have my own monitoring that’s more or less immediate but doesn’t hurt to have more redundancy. Mind sharing the config option if that’s what it is? Not sure where to find it without creating a new node to get the default config file… which seems like a pain.

Toyoo · September 3, 2023, 11:26pm

File walker is single-threaded, so it’s still below 200 IOPS. This could probably be changed in node’s code, but there’s probably no point in doing that—with this number of concurrent random reads, your file walker will just not affect much uploads/downloads that happen concurrently. So you should not worry that the file walker takes a lot of time.

Alexey · September 4, 2023, 1:11am

storagenode setup --help | grep email
      --operator.email string                                    operator email address

You may provide it after the image name in your docker run command or add/uncomment it in the config.yaml file and restart the node.

MattJE96011 · September 4, 2023, 3:35am

Yep, that’s in all my configs. Guess it just doesn’t work for some reason.

Alexey · September 4, 2023, 3:50am

Maybe it went to the spam folder?

MattJE96011 · September 4, 2023, 4:20am

Nope. I’ve run nodes for years and never seen an offline email anywhere.