Node goes down/restarts every 10-15 minutes. thread allocation error in logs

I’m having the same issue today. CPU and RAM spiking, node stops working every 15-20 mins before restarting itself. Same node has been running without issue for a few years now, nothing changed on my end.

1 Like

Do you happen to have logs (only interested in error/panic)? Try this command docker logs {storagenode_id}, also cc @Alexey

Try setting max-concurrent-requests to be some finite number, maybe 75. For me it stopped the storagenode from spawning too many processes.

2 Likes

What would be a good log level, mine is set to info and from a new start to when it crashes fairly soon it generates 60M of logs. I can zip them up for someone to look at , most of it looks to just be tons of errors but it looks a bit like this. I have 9 nodes running on the same machine under docker, plenty of RAM (still 32 GB free) and CPU, although the machine does a lot of other things too didn’t have this problem until this week or around Friday of last week.

2024-04-29T22:42:15Z    INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "27JCSGYBVBRWYETDTFBJKNOUQ6VCKH7MVCENJ42GIZBEGLEF7NMQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "192.210.226.114:54367", "Available Space": 9843802523796}
2024-04-29T22:42:15Z    INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "HESXFRBHGXB7VCPXCFSPI4IJCBNPBWW54DBDKISRQSCSB3UIYYCQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "192.210.226.114:44822", "Available Space": 9843802397844}
2024-04-29T22:42:15Z    INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "OA76SHD4WFX37Y3RUTAR46672MU4QE3VRDN34EAQPRO67XI5QJGA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "192.210.226.114:54598", "Available Space": 9843802334868}
2024-04-29T22:42:15Z    INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "E36A6DHGDHBS73URHO3XOZVNSJ4G4563AQXBXBWOUEWG3PZ6XPLA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "192.210.226.114:31226", "Available Space": 9843802334868}
2024-04-29T22:42:15Z    INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "MYUTCHSEAFWR4E7X4IEH4DN25SAHFOTR2LC3ENR7TNFYPUY5KCXQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "192.210.226.114:54610", "Available Space": 9843802271892}
runtime: program exceeds 10000-thread limit
fatal error: thread exhaustion

runtime stack:
runtime.throw({0x11f34b9?, 0x7f7387e66908?})
        /usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0x7f7387e668c8 sp=0x7f7387e66898 pc=0x43c21c
runtime.checkmcount()
        /usr/local/go/src/runtime/proc.go:802 +0x8e fp=0x7f7387e668f0 sp=0x7f7387e668c8 pc=0x44004e
runtime.mReserveID()
        /usr/local/go/src/runtime/proc.go:818 +0x2f fp=0x7f7387e66918 sp=0x7f7387e668f0 pc=0x44008f
runtime.startm(0xc000077400?, 0x1, 0x0)
        /usr/local/go/src/runtime/proc.go:2616 +0x111 fp=0x7f7387e66968 sp=0x7f7387e66918 pc=0x443071
runtime.wakep()
        /usr/local/go/src/runtime/proc.go:2760 +0xec fp=0x7f7387e66998 sp=0x7f7387e66968 pc=0x44364c
runtime.resetspinning()
        /usr/local/go/src/runtime/proc.go:3466 +0x3e fp=0x7f7387e669b8 sp=0x7f7387e66998 pc=0x4451fe
runtime.schedule()
        /usr/local/go/src/runtime/proc.go:3600 +0x10f fp=0x7f7387e669f0 sp=0x7f7387e669b8 pc=0x44562f
runtime.park_m(0xc1224d1860?)
        /usr/local/go/src/runtime/proc.go:3745 +0x11f fp=0x7f7387e66a38 sp=0x7f7387e669f0 pc=0x445adf
traceback: unexpected SPWRITE function runtime.mcall
runtime.mcall()
        /usr/local/go/src/runtime/asm_amd64.s:458 +0x4e fp=0x7f7387e66a50 sp=0x7f7387e66a38 pc=0x4704ee

goroutine 1 [semacquire, 16 minutes]:
runtime.gopark(0x4?, 0xc0008c6150?, 0x60?, 0x63?, 0xb06980?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc00012bfd0 sp=0xc00012bfb0 pc=0x43f10e
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.semacquire1(0xc00054a290, 0x0?, 0x1, 0x0, 0x28?)
        /usr/local/go/src/runtime/sema.go:160 +0x218 fp=0xc00012c038 sp=0xc00012bfd0 pc=0x450658
sync.runtime_Semacquire(0x1?)
        /usr/local/go/src/runtime/sema.go:62 +0x25 fp=0xc00012c070 sp=0xc00012c038 pc=0x46e2e5
sync.(*WaitGroup).Wait(0xc00051e828?)
        /usr/local/go/src/sync/waitgroup.go:116 +0x48 fp=0xc00012c098 sp=0xc00012c070 pc=0x47e7c8
golang.org/x/sync/errgroup.(*Group).Wait(0xc00054a280)
        /go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:56 +0x25 fp=0xc00012c0b8 sp=0xc00012c098 pc=0x91b1a5
storj.io/storj/storagenode.(*Peer).Run(0xc0001b8840, {0x15601b8, 0xc0007d41e0})
        /go/src/storj.io/storj/storagenode/peer.go:957 +0x42b fp=0xc00012c248 sp=0xc00012c0b8 pc=0xd8ac4b
main.cmdRun(0x7f73e4fb82c0?, 0xc000255b00)
        /go/src/storj.io/storj/cmd/storagenode/cmd_run.go:123 +0xd65 fp=0xc00012ce38 sp=0xc00012c248 pc=0xea45c5
main.newRunCmd.func1(0x1058ec0?, {0xc000264270?, 0xc000005200?, 0x447440?})
        /go/src/storj.io/storj/cmd/storagenode/cmd_run.go:33 +0x17 fp=0xc00012ce58 sp=0xc00012ce38 pc=0xea3817
storj.io/common/process.cleanup.func1.4({0x1560430?, 0xc00030bae0})
        /go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:393 +0x149 fp=0xc00012cee0 sp=0xc00012ce58 pc=0xab7e49
storj.io/common/process.cleanup.func1(0xc000005200, {0xc000328340, 0x0, 0xd})
        /go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:411 +0x1c88 fp=0xc00012dbc8 sp=0xc00012cee0 pc=0xab7448
github.com/spf13/cobra.(*Command).execute(0xc000005200, {0xc000328270, 0xd, 0xd})
        /go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983 +0xabc fp=0xc00012dd68 sp=0xc00012dbc8 pc=0x5d37bc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000004300)
        /go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff fp=0xc00012de40 sp=0xc00012dd68 pc=0x5d407f
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions(0xc000004300, {0x1, 0x1, 0x1, 0x0, 0x144c4e8, 0xc000269850})
        /go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:112 +0x1c9 fp=0xc00012de90 sp=0xc00012de40 pc=0xab46e9
main.main()
        /go/src/storj.io/storj/cmd/storagenode/main.go:34 +0x2bf fp=0xc00012df40 sp=0xc00012de90 pc=0xea657f
runtime.main()
        /usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc00012dfe0 sp=0xc00012df40 pc=0x43ec9b
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc00012dfe8 sp=0xc00012dfe0 pc=0x4724e1

goroutine 2 [force gc (idle), 3 minutes]:
runtime.gopark(0x24473ceb386b1f?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc0000aafa8 sp=0xc0000aaf88 pc=0x43f10e
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()

**.... tons of this type of stuff then after a while ....**

goroutine 255085 [select]:
runtime.gopark(0xc0b24cf790?, 0x2?, 0x32?, 0xa9?, 0xc0b24cf75c?)
        /usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc0b24cf600 sp=0xc0b24cf5e0 pc=0x43f10e
runtime.selectgo(0xc0b24cf790, 0xc0b24cf758, 0xc0f2f4a390?, 0x0, 0x8d5f18?, 0x1)
        /usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc0b24cf720 sp=0xc0b24cf600 pc=0x44f625
storj.io/drpc/drpcmanager.(*Manager).manageStreams(0xc109e6ef00)
        /go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:319 +0x11d fp=0xc0b24cf7c8 sp=0xc0b24cf720 pc=0xaedabd
storj.io/drpc/drpcmanager.NewWithOptions.func2()
        /go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:122 +0x25 fp=0xc0b24cf7e0 sp=0xc0b24cf7c8 pc=0xaeca05
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0b24cf7e8 sp=0xc0b24cf7e0 pc=0x4724e1
created by storj.io/drpc/drpcmanager.NewWithOptions in goroutine 255083
        /go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:122 +0x456
2024-04-29T22:56:09Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-04-29T22:56:09Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.102.3"}
2024-04-29T22:56:09Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-04-29T22:56:09Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.102.3"}
2024-04-29T22:56:09Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}
2024-04-29 22:58:44,579 INFO exited: storagenode (exit status 2; expected)
2024-04-29 22:58:45,590 INFO spawned: 'storagenode' with pid 10204
2024-04-29 22:58:45,590 WARN received SIGQUIT indicating exit request
2024-04-29 22:58:45,591 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2024-04-29T22:58:45Z    INFO    Got a signal from the OS: "terminated"  {"Process": "storagenode-updater"}
2024-04-29 22:58:45,593 INFO stopped: storagenode-updater (exit status 0)
2024-04-29 22:58:45,594 INFO stopped: storagenode (terminated by SIGTERM)
2024-04-29 22:58:45,594 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

I’m leaving swap disabled for the moment, but it really wasn’t being heavily used, it was just there as a backstop hoping to prevent the OOM killer from killing processes during rare moments when there was some extra memory pressure.

To be clear, this machine is only using <3GB out of 8GB RAM while all services are running, including the VM I mentioned in the original post, and the Storj node I’m having problems with.

So why is the Storj container suddenly needing >4GB additional RAM? And needs 35,000+ simultaneous goroutines? The way this issue popped up suddenly with no configuration changes on the server leads me to think there could be an issue with the latest Storj node software, or perhaps the satellite is too aggressive?

I’ve tried as @Ambifacient suggested, and with 75 max-concurrent-requests, I’ve got an uptime of 25 minutes so far, which is the longest it has run in days. :+1: Of course, the logs are full of upload rejections, but that’s better than spending 50% of time offline, and tying up my CPU and RAM with whatever debug output that takes circa 10 minutes to complete! My server is much happier now.

I might trying increasing that number bit by bit until I see the problem again, then back off by 10%.

I had tried 100 and 200, and at those levels my disk IOPS could not handle the incoming requests, so my average ingress decreased. The system load was somewhat linearly related to number of max requests, so I expect you’ll see something similar. Not exactly a node crash, but high load and some piece races will be lost. If you are using the grafana dashboard you can perhaps track your average bandwidth and tune to that. For me around 75 maximizes it.

Is there any where to read the current number of connections in use? I am going to try to set this on my nodes that are having the issues and see as well.

If you set the max concurrent requests, it’ll tell you the active count whenever it’s exceeded, for what that’s worth.

And if anyone at Storj is paying attention ( @Alexey ?), maybe this is a consequence of too many nodes dropping off after reduced payouts, combined with recent surge in traffic? And maybe a software problem/solution…

From my side, it’s not really worth troubleshooting problems with the server. The whole idea was to make use of unused storage resources, not push my server to the point it runs out of ram and cpu. Payouts aren’t at a level to justify server instability or my time with “learning experiences”.

2 Likes

Yea I saw that, I set it to 300 and still maxing that out but it hasn’t crashed either. Still getting upload rejected at 300, but if it runs that is better than the crashing.

1 Like

My update: setting max-concurrent-requests to 75 seemed to be a decent workaround, as the node never went more than 15 minutes without running out of threads (or RAM/getting killed by OOM killer when I turned off swap), and I just achieved over an hour. I guess everyone will come to a different optimum setting, but once I find mine, I’ll share it here. I jumped to 175 and still get lots of “upload rejected”. Still plenty of RAM, CPU, and bandwidth, so I’ll probably double it again and see what happens.

@Ambifacient – a nice idea to watch Graphana! I’ve been meaning to set it up for monitoring solar production, just don’t have time to play at the moment. But thanks very much for sharing your workaround for this frustrating issue! I won’t mark it as a solution just yet, as it doesn’t appear to address an underlying problem.

You may try to increase it: docker run | Docker Docs

I notified the team.

I set max-concurrent-requests to 75 as well, the node has been stable for 3 hours now but its still using 6 gb of RAM and 30-50% CPU

2 Likes

My issues began on 2024-04-26 at 19:24:17 CST. Since you marked this as a “solution”, I assume that means that your synthetic testing was the trigger?

I signed up to share unused storage capacity. If the HDDs are already spinning, might as well put them to use, right? But if it comes at the expense of overburdening my server’s CPU and RAM, I’ll have to reconsider whether it makes sense to continue as a node operator, especially in light of drastic reduction in payout.

I’m very pleased that Storj has met with some sales success! But I hope, for your sake and mine, that you’ll make every effort to respect the CPU and RAM resources of your node operators, realizing that the reduced payouts don’t really justify anything more than a hands-off approach for those of us who aren’t “big fish”. Maybe you could’ve implemented your testing a bit less aggressively, monitored performance metrics, and tuned your software, instead of whatever it is you’ve done.

Here’s hoping that the expansion goes smoothly. :crossed_fingers:

2 Likes

Anecdotally, it does not “overburden” cpu or ram. I did not notice any difference. CPU is almost idle. Ram usage stayed the same.

I did notice slightly higher power consumption due to higher disk activity, but that’s not unexpected. More iops — more power burnt moving the heads. (Before anyone asks I was looking when there was no other activity but storagenode)

If you see storagenode consuming more ram — that means your storage configuration can benefit from optimizations.

2 Likes

Are you suggesting that your software needs more than 35k+ active threads? I joined Storj to share excess storage, but not to have my server subjected to a DoS attack. Please forgive my bluntness, but I hope that you (and team) will be more careful with your testing and the software that I allow you to run on my server–unless it’s your wish to kill off the smaller nodes, and only leave the operators who have a large vested interest in dedicating large amounts of RAM, CPU, and their troubleshooting time.

1 Like

I check back my nodes, some of my nodes with swap was also intermittent down time and autorestart (I disable swap and the problem seem go away so I didn’t bother to check logs):

3a0122da2cd8 storjlabs/storagenode:latest “/entrypoint” 4 weeks ago Up 12 hours
bbc8961f0c3e storjlabs/storagenode:latest “/entrypoint” 7 days ago Up 14 hours

So this claim do have some merit…

Appreciate your sharing of your story. My anecdote is different, and takes my server from having under 50% RAM and 20% CPU to nearly 100% CPU and OOM killer coming into play. Maybe related to having 35,000+ active threads. Perhaps my puny little server isn’t capable of handling all the test data being thrown at it? The node in question has a dedicated drive, but for whatever reason (debug data dump, I presume), during the 10 minutes of downtime the RAM and CPU are out of control. And considering it would only stay up for 10 minutes at a time, it was basically 50% of the time that my server was in dire straights.

With respect, for the amount of payment, it’s very hard to imagine spending a lot of time and effort optimizing.

1 Like

That’s the problem. Disk can’t handle IO rate, node caches stuff in ram, gets killed.

Correct. Your node failed the test. Good that this was a test and not actual customer traffic. Fix the disk subsystem. Start with low hanging fruits — like atime updates, database, filesystem. Add ram for caching.

You may buy yourself some time, but single drive for storj is a no go going forward when traffic picks up.

Sorry, this sounds like lunacy. So node will grab as much RAM as it can by default?

Guess I’m out, then.

1 Like