Node goes down/restarts every 10-15 minutes. thread allocation error in logs

arrogantrabbit · May 2, 2024, 2:24am

It’s a public forum, so im not going to entertain that request.

Node consumes ram to save the day on potato hardware. It’s better than dropping requests. On non-potato hardware, node consumes under one 2G of ram, and that is pretty stable. It’s not a bug, it a behavior by design. It’s not an issue on properly configured disk subsystem.

OOOm killers are needed to maintain system stability. They operate on a different level.

Did you read what I wrote above?

Please explain to me (but more importantly, to yourself) how are you planning to serve needs of multiple customers with 200 IOPs budget. Not based on just wishful thinking of developers pulling off some magic, but in actual technical words and numbers.

To reiterate, you either [get storj to] throttle [requests to] your node, making it for intents and purposes useless; ( it’s not interesting nor high priority create the whole load balancing system for potato nodes; not at this stage).

Or you fix your disk subsystem by a number of ways discussed in this thread to make it tolerant to high io workloads. I’m not going to repeat them here again.

The bottom line: if you only can offer 200 iops — might as well shut down the node.

To put it more directly: The whole thread is a non-issue if you setup your hardware correctly. Instead you expect storj to workaround your improperly configured system .

consumerbot797287 · May 2, 2024, 2:39am

Cool. I’m done interacting with you, though. Peace.

mgonzalezm · May 2, 2024, 5:17am

FYI, you can put any user in you ignored lists. just go to profile/preferences/users.

That is one way to deal with those users who like to argue (just for the sake of arguing) in every single thread in this forum.

Ruskiem · May 2, 2024, 6:13am

braaah, no, You don’t need more memory, my VMs have 4GB, 1 disk = 1 node. its just the the RAM’s speed (and controllers)
When i have 1 VM Running, it counts small files in 30min, when i run 13 of them, it takes like 60-70h for same work. Gotto be phsysical limits of the machine with 4 x 16GB 3200Mhz RAM, and some optimalization in newer controllers also can help! (so i gotto try newer versions soon)

Edit: no, i mean 60-70h, for filewalking like 6,5TB of storj’s current files.
Yea 1 machine. I use VPN’s, coz its the easiest way to setup all that networking.
(nothing to do on my router)

kocoten1992 · May 2, 2024, 6:26am

Good thing to hear, 60-70h do you mean 6-7h? Does everything is on the same computer? (how do you map multiple WAN IP address into 1 computer?).

Alexey · May 2, 2024, 7:35am

I’m suggesting how to try to solve the issue which your node have (mines three (2 Windows Docker Desktop - worst what I can imagine, and one Windows GUI) - do not have, however YMMV).
If the disk subsystem is slow, it could have more active threads than designed, so you may solve this by this suggestion, it’s up on you.

Alexey · May 2, 2024, 7:40am

I would say, that I never used RAID for Storj and my nodes operates normally, even on Windows. Well 2 are Docker Desktop for Windows (what could be worse?) and the GUI one.
No big problems so far (except rare “database is locked” and seems only for the expiration pieces database so far), I still have databases on the data drives.

Alexey · May 2, 2024, 7:43am

If your disk is a network storage/SMR/diskettes - yes.

Alexey · May 2, 2024, 7:53am

This is a number of threads (processes) limited by the Linux kernel accordingly the default configuration, the CPU is not involved.
In this case it’s caused by a slow disk subsystem, unfortunately. This is why we perform a stress test - to show problems in the setups.

Alexey · May 2, 2024, 7:58am

Not necessarily. My system has 32GB, which are barely used (~10TB). Everything is depend on how fast your disk subsystem can operate. My disks are connected via SATA and I do not use any RAID or caching device (on Windows it’s close to useless, unless you will use some advanced cache software).
I use Windows specifically, because most of Operators are using it, I have had a Linux one on Pi3, but the SD card is died and I’m far away from the installation to fix this simple issue. However, it worked perfectly fine even with 1GB of RAM.

Alexey · May 2, 2024, 7:59am

We have had a holiday. I believe it will continue today.

Alexey · May 2, 2024, 8:03am

No, they use a VMs and multiple nodes on a limited hardware, so 60h-70h sounds reasonable.

agente · May 2, 2024, 8:42am

If you store data on HDD you will always have 200 iops limit. You can buffer somewhere before but you will explode your buffer. I think nobody saves data on SSD (yet)

arrogantrabbit · May 2, 2024, 2:48pm

And that’s precisely the point. You have hard ceiling with hdD, and therefore need to be wise in how you spend this limited HDD IO budget; this requires some configuration, beyond plopping the disk to raspberry pi and hoping for the best — it’s just waste of a good hdd.

for example:

sending all small bits of data, including metadata, to a small SSD, leaving HDD to handle large-ish blobs. This will offload majority of IOPs and may be sufficient on its own. The goal is to make seek time small enough compared to data transfer time, to keep disk busy doing data transfer as opposed to moving heads around majority of the time. I would start there.
batching and coalescing writes together in some sort of transaction groups (ZFS does it). This further reduces IO by combining multiple writes into a single request.
doing all other filesystem tweaking discussed here multiple times
avoiding databases sending io to disk, if for some reason there are still there.

Toyoo · May 2, 2024, 9:59pm

At ~200 RPM and 34 units visible on the photo you do get around 220 IOPS The capacity is not there though.

Alexey · May 3, 2024, 7:48am

Someone stores, I believe I saw someone on the forum.

agente · May 3, 2024, 8:29am

If I wanted to use just RAM… how many GB per TB needed for an optimal setup according to your experience?

Toyoo · May 3, 2024, 2:46pm

I do my estimates in the following way.

First, estimate the number of pieces you expect to have.

You can look at the average segment size here. As of writing this post this is 7.25 MB.
Divide it by 29 (the number of pieces required to reconstruct a segment) to get average piece size. We get 250 kB.
Divide your allocated disk space by the average piece size. For 1 TB this is 4M pieces expected.

Now, estimate the amount of RAM you need per file. These numbers depend a lot on your software stack, as any additional layer (like storage spaces, RAIDs, whatever) adds their own requirements. Assuming you have a file system set up directly on a raw partition of a single HDD, this would be:

For default ext4 this is around 300 bytes (inode + direntry data structures).
For default NTFS this is probably around 1kB.

You multiply the number of pieces expected by the amount of RAM you need per file to get the estimate.

Remember, this is an estimate of the amount of RAM that needs to be free after your OS, the node itself (assume less than 0.5 GB), and all other software running on the same system take their chunks.

arrogantrabbit · May 3, 2024, 5:36pm

The idea being the amount of unused ram, available for caching, shall be enough to fit metadata.

How much exactly – depends on your filesystem. Metadata sizes differ, granularity of caching differ, caching implementation differ. Some filesystems don’t take advantage of all available ram at all – e.g. NTFS.

All I can say is for a 10TB node on ZFS 8GB is probably, 128GB is definitely overkill, 32GB seemed OK, I saw almost all metadata fetches come from ARC

mattventura · May 3, 2024, 6:38pm

I understand that a slower node is likely going to have a higher load average, as it will take longer for requests to be serviced, so there will be more requests in flight at any given moment.

However, why not just set a reasonable default to max-concurrent-requests? I can’t think of any good reason for there to be thousands of requests in flight on a given node, because it would just lose those races anyway and the SNO wouldn’t earn anything from them. If anything, it’s just going to spiral downwards, where the increased load causes them to lose even more races.

Edit: apparently, it used to have a default, but was removed at some point? Why? There can’t possibly be a valid use case for allowing >1000 concurrent requests.