Node goes down/restarts every 10-15 minutes. thread allocation error in logs

arrogantrabbit · May 1, 2024, 2:48am

What exactly is lunacy?

Node receives write requests. Sends them to disk. Disk saves data slower than node receives new data. Where shall node put that extra data? In ram. Hoping the disk will catch up. But it never does. Ram usage keep increasing until node is killed by oom watcher or the whole system dies.

(If you have a clogged slow draining sink in the kitchen and you open the faucet, water will first occupy entire sink volume and then overflow on the floor. So you will come and kill the flow with the master vent just before this is allowed to happen )

This is behavior by design. What alternative do you suggest? Reject data? Maybe. It’s not clear how to implement limits when it’s p2p transfers.

And that’s fine. You seem to be running this hardware specifically for storj, which was always recommended against.

If you had a proper array (for literally any storage needs you would have an array, never singular disks) — storage node workload would not have even made a dent.

kocoten1992 · May 1, 2024, 2:57am

I mean it kinda possible with high water mark/low water mark concept, if it too much node can reject incoming data and tell peer to backoff (in nodejs - Node.js — Backpressuring in Streams). Guess we will see…

kocoten1992 · May 1, 2024, 3:18am

I just remember this video https://www.youtube.com/watch?v=kpvbOzHUakA&ab_channel=ScaleConf, for example OP node broadcast it could receive data back and instantly receive a ton of data from peers and dead in the water again, the power of two choices could really help.

storaje · May 1, 2024, 3:28am

Arrogant much? This is exactly what they recommend for storagenode operators. No raid, one disk per node. We do not need raid specifically because the network replicates the data.

@consumerbot797287 It sounds like this is some kind of embedded system you’re using. The issue isn’t with your CPU or RAM specifically but with the connection to your hard drive. How is your hard drive connected?

I’m using raspberry pi 4 with USB 3 SATA controllers and I am in the same boat with my nodes crashing during this stress test. My experience using USB connected hard drives has been pretty poor and this recent bout was the icing on the cake. If I upgrade my hardware, it will be to something with native SATA or native pci-e for a SATA or SAS pci-e controller.

consumerbot797287 · May 1, 2024, 3:33am

I’m sure there are best practices that I’ve not looked into, and I ought to have set up CPU or RAM limits on the docker container. But to spawn 35,000+ active threads and keep grabbing as much RAM as you can seems like greedy (to put it kindly) software.

Uh, yeah? Sorry, but again, seems like lunacy. I’m operating a Storj node, and offering a specific amount of storage capacity. Not every bit of available RAM and CPU. And the software ought to have some constraints by default, in my opinion. Tell the satellite “I’m maxxed out, sorry.”

Yep. Had this node on an array using “excess” capacity. Moved it to a dedicated drive when I needed that capacity. Wasn’t going to spin up a new array just to keep the node going.

Explained above.

Is it? They’re worried about the network’s capacity, but you’re happy to send node operators packing? Bigger share for you, right?

consumerbot797287 · May 1, 2024, 3:43am

Yeah, it’s an SoC-type. And you guessed it–it’s connected with USB.

I’ve toyed with the idea of upgrading the server, but if Storj isn’t running, it’s way beyond capable. Modest payments from Storj can’t justify the expense. Would never achieve ROI.

storaje · May 1, 2024, 5:26am

I’m legitimately surprised that this is an issue. I can get 100+ MB/s sequential sustained transfers on my disks over USB 3. During the tests I was seeing network traffic of 10 up / 20 down mbit/s across 4 nodes. I wouldn’t think that would bring my system to the brink as it did.

striker43 · May 1, 2024, 5:47am

When I am writing big files to my USB-connected drives, I also have no issues and a high throughput. However, when reading and writing a lot of small files (like storagenode does), I see that the IOPS are very limited via a USB connection compared to SATA.

kocoten1992 · May 1, 2024, 8:23am

I think it perfectly reasonable to have usb connector, I opt for push back to have lower IO, should be working even with usb 2.0…, or if run as low priority work - that should be fine too.

consumerbot797287 · May 1, 2024, 12:06pm

As well it shouldn’t. I think @arrogantrabbit likely had it correct earlier when he described the sequence of disk IO getting backed up, and node software just keeps caching further data, optimistic that it will eventually get written. But for it to do that to the point it’s grabbed 4+GB of RAM (and waits for OOM killer) is bad behavior, in my judgement.

Maybe the point of this “stress test” was intended to uncover issues like this. But it seems the software issue could’ve easily been made manifest by testing a single node with an underperforming disk–maybe connected with USB 1 --just to make sure the software handles the situation correctly.

Toyoo · May 1, 2024, 1:15pm

I know this is not helping you much, but unless you have enough RAM for file metadata caching (or SSD, or any other decent caching solution), then indeed storage nodes will be slow. The current design trades off resources for simplicity. As a node operator myself I had to deal with this tradeoff too. For a typical non-virtualized Linux setup with ext4, you would need to aim for at least 1 GB of RAM dedicated to caching file metadata for each 1 TB of pieces stored, probably more.

There are ongoing conversations on improving this situation, but for now it might be indeed more healthy for you to shut down your node if you do not want to add a caching solution.

storaje · May 1, 2024, 1:47pm

Besides filesystem cache in RAM, I’ve never heard of a file metadata caching solution. Do you mean swap on an SSD? In that case vfs_cache_pressure appears to be the configuration item.

Unless you’re talking about moving the storagenode databases? I’ve never really seen the advantage to that. Even in this case, the IO issue is with data to the disk, not storage metadata afaik.

Toyoo · May 1, 2024, 1:51pm

Any block device caching solution will do (bcache, lvmcache, etc.). They cache everything, not just metadata, but the important thing is that they will cache metadata. And if their eviction algorithms are any good, metadata will stay in cache.

Though, block device caching is not as granular as RAM-based cache, so you need more of it.

arrogantrabbit · May 1, 2024, 5:52pm

Not really. The recommendation was 1 node per disk. Meaning, don’t run more nodes than you have disks. It was not exhaustive and just a ballpark.

I have never seen recommendation against using arrays for storagenode. It would simply not be feasible.

For one, you cannot sustain a node just on a single disk. It’s not magic: average file size is lets say 16k (many people posted histograms here on the forum), assuming very generously that you need two IOPS to write a file (write data and update directory), the average HDD that can only sustain 200 IOPS can handle 200/2*16k/sec=1.5MB/sec traffic from storj. Under 2 megabytes per second. It’s laughable. It maybe was OK in the early days, but come one. 2MB/s in 2024?

For the other, the actual advice was “use existing capacity”. People don’t just run random disks in the closet that happen to be also available for storj. People have arrays. Even adding a simple cache device in front of your HDD will drastically improve the capacity.

In reality there are way more IOPS generated due to database writes and other system activity. So, single drive is a no-go for a node. Period.

Not all array configurations are redundant btw. Yes, you don’t need redundancy for the node. But you don’t build the system just for node. You let node use existing system built for different purpose. And virtually all other purposes require some sort of array configuration, even if it’s just cache or a single metadata device (no redundancy). Vast majority of modern HDDs are designed to be use in RAID config, they are not reliable enough nor performant enough standalone.

Separately, I don’t’ like to call it “replication”, because it isn’t, it’s encoding, but I don’t know how to better say it.

I disagree. Unused resource are wasted resources. If there is available ram – why not use it?

If would be nice to have obviously, but it’s not done at the moment, and I imagine optimizing performance of nodes running on potato is on the bottom of priority list. As I said, my storage appliance in the closet barely felt additional load during last few days. I just saw more ingress on my router network stats. IN other words, this is not a problem that needs immediate solution.

Think about it, even if it was implemented – you node would only be able to sustain 2Mbps traffic (see above). Might as well shut it down, because it’s useless, if average traffic is, say, 200MBps. It’s “off-line” for most of the traffic anyway.

That’s where you shut down the node. I have multiple nodes on the array for this exact purpose. I rent out space until I need it. Then nuke the node (I used to not bother with graceful exit, but now with 30 days fixed duration I might, just to be nice) to free up space.

See above. If the demand is 200MBps, and your node can only sustain 2Mbps – it has already effective packed and left.

Toyoo · May 1, 2024, 7:17pm

While in general I agree with the statements you make, please allow me to add some commentary for others who do not follow the forum.

Average file across the whole network can be estimated by dividing the average segment size (for the record: currently 7.31 MB) by 29. So the average file now would be 252 kB. This number depends on customer behavior. Also, as this number shows an average across all, old and new, files, it may not necessarily reflect the average uploaded file size in a given period of time.

I was happy for my design draft to be mentioned recently in Storj’s post. For the context of this conversation, it has a potential to reduce the average number of writes to disk to potentially significantly less than 1, coalescing writes required for many files into a small number of operations. As such, it seems technically feasible to handle large traffic even on low-I/O storage. It has to be underlined that it is a complex proposal that will require a lot of engineering time, and as such, not likely in short-term future.

In this case it’s a symptom of a bottleneck, and not healthy take on unused resources. While it might be a good idea to handle very short peaks, it’s not a solution to elevated, but sustained traffic. In the latter case, it eats resources while not solving the actual problem, and hence is useless.

Please note that this is the purpose of the max concurrent uploads switch. It is a quick and dirty solution for now, sure, but it’s a sign that solutions of this type would be acceptable for Storj. I hope that the recent work to expose I/O metrics would lead to a similar mechanism, but based on actual measurements.

The node might instead focus on serving downloads. This is, after all, the core purpose of storage. Have the node grow at the rate it can accept, even if it is “just” 2 MB/s.

mattventura · May 1, 2024, 8:41pm

Allow me to add my 2c:

“One node:one drive” can be perfectly okay. My first node was a test node with a USB 2.0 hard drive, and it still works okay. You can say that 1.5MB/s is laughably low, but that equates to ~3.8TB/mo. Set up a fresh node, and tell me how long it takes to fill up 3.8TB. Way longer than a month. Sure, I run the rest of mine on ZFS arrays, but that’s more so because I’m using bargain bin used hard drives that I don’t trust to not fail spontaneously.

As for read performance, this was a lot more important back when you were paid a lot for egress. With current payouts, it doesn’t matter nearly as much if you lose some egress races.

However, as for CPU usage, the issue where it was using thousands of threads is, quite clearly, a bug. Software simply shouldn’t be doing that. It seems to be fixed now, as I haven’t run into this again. Even if the threads are in iowait rather than actively using CPU, this can still cause some system slowness.

kocoten1992 · May 2, 2024, 12:44am

Thanks @Toyoo, I’ve always learn something from your input (as in general).

Regarding these, it take me sometime to slowly realize how absurdity this is, let say someone have 16TB hdd, do they need to prepare 24GB of memory or even 32GB? How absurdly expensive to become SNO? I can’t…

consumerbot797287 · May 2, 2024, 1:19am

Because any modern, modest system is already using all “available” RAM for caching and buffers. OOM killer shouldn’t have been needed. And using “available” CPU cycles will grow the power bill and throw off more heat. Agree to disagree, maybe? But I sure hope the devs don’t subscribe to your opinion. Anyway, if you don’t have any suggestions for me beyond shutting down the node and making fun of my hardware, I’d kindly request you refrain from offering further “advice” in my thread.

Thanks very much for your contributions and the information, everything is running quite smoothly now that I’ve implemented it. But it seems the testing was stopped, as I don’t see an indication that the concurrent connection limit I set is being exceeded any longer.

Yep, that was my plan all along. The node used to be completely full, but it seems a bunch of data was recently purged, quickly followed by the testing.

Toyoo · May 2, 2024, 1:36am

Well, for now it’s that or an SSD-based caching solution. The latter is often quite a bit cheaper, and potentially useful for other purposes.

lyoth · May 2, 2024, 2:11am

Lvm ssd caching would be the cheapest way to increase performance.
All writes are cached onto ssd and pushed to hdd slowly