Node goes down/restarts every 10-15 minutes. thread allocation error in logs

consumerbot797287 · May 7, 2024, 8:33pm

Sorry, but my experience over several days doesn’t match. I experimented with docker run -m "1024m" --memory-reservation="768m". What I saw is that node will approach the 1GB limit, then stay at 99-99.9% of the assigned RAM utilization for a period, but without any indication of a problem in the logs, and no crashes. I even tried reducing it further to 768MB. A few hours later, node is still running, sitting at 99% RAM use. So it looks like docker enforces RAM constraint by denying node software request for more RAM, not by simply killing the container. This is my own observation, I’m no docker expert.

On this 4TB node (the one with 2.5" USB SMR HDD), RAM only gets filled up when node reports free space available. If I set -e STORAGE="x.xTB" with x = slightly smaller amount than used, RAM consumption is much, much lower. So it looks like this puny 2.5" SMR drive just won’t be able to handle serving download and upload requests at the same time, even with 2 other nodes online reporting free space. RAM gets filled, and many (mostly) lost races in logs. But with the docker memory constraint, at least it’s stable! (and doesn’t cause my server to use swap) This is the reason I so adamant that it belongs in the main documentation, too.

I see these options for myself:

keep running like this, losing most races
reduce storage parameter so no free space reported
take down node
migrate to faster disk
bring more nodes online to share the load

Last two options would be better for Storj, but I’m not inclined to commit further resources so long as node software remains “ungraceful” when it comes to handling overload. The juice was barely worth the squeeze without stress testing triggering server problems.

This looks like a great idea! It certainly did the trick for me (see above). No need for any protocol change, would just need a way to identify overloaded state (too great % of recent races lost?).