Node goes down/restarts every 10-15 minutes. thread allocation error in logs

consumerbot797287 · April 29, 2024, 1:06pm

I run 3 nodes in docker images on this machine. 2 of the 3 are working fine. I’ve suddenly begun having an issue with the oldest (and largest) node a couple days ago. It only stays up for about 10 to 15 minutes before crashing with a thread allocation error. Docker restarts it again after about 10 minutes, but then it crashes again after 10-15 minutes. So I end up with about 10 minutes uptime, followed by 10 minutes of downtime in a loop.

Relevant log entries:

2024-04-29T12:37:48Z	INFO	piecestore	upload started	{"Process": "storagenode", "Piece ID": "XXX", "Satellite ID": "XXX", "Action": "PUT", "Remote Address": "XXX:35465", "Available Space": 582921609103}
2024-04-29T12:37:48Z	INFO	piecestore	upload started	{"Process": "storagenode", "Piece ID": "XXX", "Satellite ID": "XXX", "Action": "PUT", "Remote Address": "XXX:33830", "Available Space": 582921546127}
2024-04-29T12:37:48Z	INFO	piecestore	upload started	{"Process": "storagenode", "Piece ID": "XXX", "Satellite ID": "XXX", "Action": "PUT", "Remote Address": "XXX:44408", "Available Space": 582921546127}
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0xf7903f m=8655 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: g 0: unknown pc 0xf7903f
stack: frame={sp:0x7897114b2508, fp:0x0} stack=[0x789711492e50,0x7897114b2a50)
0x00007897114b2408:  0x0000000000f80422  0x00007897114b24d0
0x00007897114b2418:  0x0000000000000037  0x0000000000000000
0x00007897114b2428:  0x0000000000000000  0x0000000001d079a0
0x00007897114b2438:  0x0000000001d82bd8  0x0000000000000001
0x00007897114b2448:  0x0000000000000037  0x0000000001cf87e0
0x00007897114b2458:  0x00007897114b2520  0x00007897114b24a8
0x00007897114b2468:  0x0000000000f7b5e2  0x00000000004724e0 <runtime.goexit+0x0000000000000000>
0x00007897114b2478:  0x00000000015a121a  0x00007897114b2490
0x00007897114b2488:  0x0000000000000000  0x0000003000000010
0x00007897114b2498:  0x00007897114b26f0  0x00007897114b2620
0x00007897114b24a8:  0x0000000000000000  0x0000000000000000
0x00007897114b24b8:  0x0000000000000000  0x0000000000000000
0x00007897114b24c8:  0x0000000000000000  0x5f64616572687470
0x00007897114b24d8:  0x6620657461657263  0x52203a64656c6961
0x00007897114b24e8:  0x20656372756f7365  0x7261726f706d6574
0x00007897114b24f8:  0x76616e7520796c69  0x00656c62616c6961
0x00007897114b2508: <0x0000000000f79084  0x00007897114b2588
0x00007897114b2518:  0x0000000000000000  0x0000000000000000
0x00007897114b2528:  0x0000000000f76451  0x0000000000023000
0x00007897114b2538:  0x000078971146d000  0x0000000001d83830
0x00007897114b2548:  0x0000000000f80422  0x0000000001d82bd8
0x00007897114b2558:  0x0000000000f80422  0x0000000001d82bd8
0x00007897114b2568:  0x0000000000000000  0x00007897114b25cf
0x00007897114b2578:  0x0000000000000001  0x0000000000000001
0x00007897114b2588:  0x0000000001cf87e0  0x0000000001cf886c
0x00007897114b2598:  0x000000000000000a  0x0000000000000000
0x00007897114b25a8:  0x0000000001cf87e0  0x00000000015a121a
0x00007897114b25b8:  0x0000000000f745f8  0x0000000000000000
0x00007897114b25c8:  0x0a00000001cf87e0  0x0000000001cf87e0
0x00007897114b25d8:  0x0000000000f79a53  0x0000000001cf87e0
0x00007897114b25e8:  0x00000000015a121a  0x0000789760ab7740
0x00007897114b25f8:  0x0000000000f736ec  0x0000000000462938 <runtime.(*unwinder).resolveInternal+0x0000000000000158>
runtime: g 0: unknown pc 0xf7903f
stack: frame={sp:0x7897114b2508, fp:0x0} stack=[0x789711492e50,0x7897114b2a50)
0x00007897114b2408:  0x0000000000f80422  0x00007897114b24d0
0x00007897114b2418:  0x0000000000000037  0x0000000000000000
0x00007897114b2428:  0x0000000000000000  0x0000000001d079a0
0x00007897114b2438:  0x0000000001d82bd8  0x0000000000000001
0x00007897114b2448:  0x0000000000000037  0x0000000001cf87e0
0x00007897114b2458:  0x00007897114b2520  0x00007897114b24a8
0x00007897114b2468:  0x0000000000f7b5e2  0x00000000004724e0 <runtime.goexit+0x0000000000000000>
0x00007897114b2478:  0x00000000015a121a  0x00007897114b2490
0x00007897114b2488:  0x0000000000000000  0x0000003000000010
0x00007897114b2498:  0x00007897114b26f0  0x00007897114b2620
0x00007897114b24a8:  0x0000000000000000 runtime/cgo: pthread_create failed: Resource temporarily unavailable
 0x0000000000000000
0x00007897114b24b8:  0x0000000000000000  0x0000000000000000
0x00007897114b24c8:  0x0000000000000000  0x5f64616572687470
0x00007897114b24d8:  0x6620657461657263  0x52203a64656c6961
0x00007897114b24e8:  0x20656372756f7365  0x7261726f706d6574
0x00007897114b24f8:  0x76616e7520796c69  0x00656c62616c6961
0x00007897114b2508: <0x0000000000f79084  0x00007897114b2588
0x00007897114b2518:  0x0000000000000000  0x0000000000000000
0x00007897114b2528:  0x0000000000f76451  0x0000000000023000
0x00007897114b2538:  0x000078971146d000  0x0000000001d83830
0x00007897114b2548:  0x0000000000f80422  0x0000000001d82bd8
0x00007897114b2558:  0x0000000000f80422  0x0000000001d82bd8
0x00007897114b2568:  0x0000000000000000  0x00007897114b25cf
0x00007897114b2578:  0x0000000000000001  0x0000000000000001
0x00007897114b2588:  0x0000000001cf87e0  0x0000000001cf886c
0x00007897114b2598:  0x000000000000000a  0x0000000000000000
0x00007897114b25a8:  0x0000000001cf87e0  0x00000000015a121a
0x00007897114b25b8:  0x0000000000f745f8  0x0000000000000000
0x00007897114b25c8:  0x0a00000001cf87e0  0x0000000001cf87e0
0x00007897114b25d8:  0x0000000000f79a53  0x0000000001cf87e0
0x00007897114b25e8:  0x00000000015a121a  0x0000789760ab7740
0x00007897114b25f8:  0x0000000000f736ec  0x0000000000462938 <runtime.(*unwinder).resolveInternal+0x0000000000000158>

goroutine 1 [semacquire, 7 minutes]:
runtime.gopark(0x4?, 0xc000044480?, 0xa0?, 0x85?, 0xb06980?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc0005e3fd0 sp=0xc0005e3fb0 pc=0x43f10e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:404
runtime.semacquire1(0xc000462390, 0x0?, 0x1, 0x0, 0x10?)
	/usr/local/go/src/runtime/sema.go:160 +0x218 fp=0xc0005e4038 sp=0xc0005e3fd0 pc=0x450658
sync.runtime_Semacquire(0x1?)
	/usr/local/go/src/runtime/sema.go:62 +0x25 fp=0xc0005e4070 sp=0xc0005e4038 pc=0x46e2e5
sync.(*WaitGroup).Wait(0xc000390b10?)
	/usr/local/go/src/sync/waitgroup.go:116 +0x48 fp=0xc0005e4098 sp=0xc0005e4070 pc=0x47e7c8
golang.org/x/sync/errgroup.(*Group).Wait(0xc000462380)
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:56 +0x25 fp=0xc0005e40b8 sp=0xc0005e4098 pc=0x91b1a5
storj.io/storj/storagenode.(*Peer).Run(0xc0005dc000, {0x15601b8, 0xc0002bc730})
	/go/src/storj.io/storj/storagenode/peer.go:957 +0x42b fp=0xc0005e4248 sp=0xc0005e40b8 pc=0xd8ac4b
main.cmdRun(0x789760de19b0?, 0xc00020fb00)
	/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:123 +0xd65 fp=0xc0005e4e38 sp=0xc0005e4248 pc=0xea45c5
main.newRunCmd.func1(0x1058ec0?, {0xc00021e270?, 0xc000005200?, 0x447440?})
	/go/src/storj.io/storj/cmd/storagenode/cmd_run.go:33 +0x17 fp=0xc0005e4e58 sp=0xc0005e4e38 pc=0xea3817
storj.io/common/process.cleanup.func1.4({0x1560430?, 0xc0002bfae0})
	/go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:393 +0x149 fp=0xc0005e4ee0 sp=0xc0005e4e58 pc=0xab7e49
storj.io/common/process.cleanup.func1(0xc000005200, {0xc000184e10, 0x0, 0x9})
	/go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:411 +0x1c88 fp=0xc0005e5bc8 sp=0xc0005e4ee0 pc=0xab7448
github.com/spf13/cobra.(*Command).execute(0xc000005200, {0xc000184d80, 0x9, 0x9})
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983 +0xabc fp=0xc0005e5d68 sp=0xc0005e5bc8 pc=0x5d37bc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000004300)
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115 +0x3ff fp=0xc0005e5e40 sp=0xc0005e5d68 pc=0x5d407f
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039
storj.io/common/process.ExecWithCustomOptions(0xc000004300, {0x1, 0x1, 0x1, 0x0, 0x144c4e8, 0xc00022b350})
	/go/pkg/mod/storj.io/common@v0.0.0-20240329051534-e16d36937e83/process/exec_conf.go:112 +0x1c9 fp=0xc0005e5e90 sp=0xc0005e5e40 pc=0xab46e9
main.main()
	/go/src/storj.io/storj/cmd/storagenode/main.go:34 +0x2bf fp=0xc0005e5f40 sp=0xc0005e5e90 pc=0xea657f
runtime.main()
	/usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc0005e5fe0 sp=0xc0005e5f40 pc=0x43ec9b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0005e5fe8 sp=0xc0005e5fe0 pc=0x4724e1

goroutine 2 [force gc (idle), 2 minutes]:
runtime.gopark(0x20cec756722?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000064fa8 sp=0xc000064f88 pc=0x43f10e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:404
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:322 +0xb3 fp=0xc000064fe0 sp=0xc000064fa8 pc=0x43ef73
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc000064fe8 sp=0xc000064fe0 pc=0x4724e1
created by runtime.init.6 in goroutine 1
	/usr/local/go/src/runtime/proc.go:310 +0x1a

goroutine 3 [GC sweep wait, 2 minutes]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc000065778 sp=0xc000065758 pc=0x43f10e
runtime.goparkunlock(...)
	/usr/local/go/src/runtime/proc.go:404
runtime.bgsweep(0x0?)
	/usr/local/go/src/runtime/mgcsweep.go:321 +0xdf fp=0xc0000657c8 sp=0xc000065778 pc=0x42925f
runtime.gcenable.func1()
	/usr/local/go/src/runtime/mgc.go:200 +0x25 fp=0xc0000657e0 sp=0xc0000657c8 pc=0x41e3a5
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0000657e8 sp=0xc0000657e0 pc=0x4724e1
created by runtime.gcenable in goroutine 1
	/usr/local/go/src/runtime/mgc.go:200 +0x66

(MANY SIMILAR ENTRIES SNIPPED)

goroutine 56934 [select]:
runtime.gopark(0xc06cae4ee0?, 0x3?, 0xfa?, 0xe6?, 0xc06cae4e82?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc06cae4d20 sp=0xc06cae4d00 pc=0x43f10e
runtime.selectgo(0xc06cae4ee0, 0xc06cae4e7c, 0xffffffffffffffff?, 0x0, 0x1?, 0x1)
	/usr/local/go/src/runtime/select.go:327 +0x725 fp=0xc06cae4e40 sp=0xc06cae4d20 pc=0x44f625
storj.io/drpc/drpcmanager.(*Manager).manageStream(0xc0e50b52c0, {0x1560180, 0xc0e50ecde0}, 0xc0e2462900)
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:332 +0xf1 fp=0xc06cae4f20 sp=0xc06cae4e40 pc=0xaedc71
storj.io/drpc/drpcmanager.(*Manager).manageStreams(0xc0e50b52c0)
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:321 +0x85 fp=0xc06cae4fc8 sp=0xc06cae4f20 pc=0xaeda25
storj.io/drpc/drpcmanager.NewWithOptions.func2()
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:122 +0x25 fp=0xc06cae4fe0 sp=0xc06cae4fc8 pc=0xaeca05
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc06cae4fe8 sp=0xc06cae4fe0 pc=0x4724e1
created by storj.io/drpc/drpcmanager.NewWithOptions in goroutine 56932
	/go/pkg/mod/storj.io/drpc@v0.0.34/drpcmanager/manager.go:122 +0x456

rax    0x0
rbx    0x6
rcx    0xf7903f
rdx    0x0
rdi    0x2
rsi    0x7897114b2520
rbp    0x7897114b2520
rsp    0x7897114b2508
r8     0xa
r9     0x1cf886c
r10    0x8
r11    0x246
r12    0x789760ab7740
r13    0x0
r14    0xc0e5445040
r15    0x10
rip    0xf7903f
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

I’ll confess, the server is rather heavily burdened. I did try shutting down the VM that’s also running on the server, but it didn’t help. I’ve tried:

Reboot the server
Increasing the kernel thread limit (on Linux, Ubuntu server 6.5.0-28-generic).
Deleting the docker image and recreating it

Nothing seemed to make a difference.

kocoten1992 · April 29, 2024, 1:46pm

I found this answer from golang team cmd/go: pthread_create failed: Resource temporarily unavailable · Issue #51678 · golang/go · GitHub. One thing stood out - could you check if have enough memory on the system?

consumerbot797287 · April 29, 2024, 2:10pm

Thanks for that. RAM is certainly constrained on the server. There’s plenty of swap available though, and I don’t encounter any other issues with available memory for any of the other processes running… still, I might try shutting down the VM for a longer period, just to see if it makes a difference. I’m pretty sure I’d already tried that without success though.

nyancodex · April 29, 2024, 2:18pm

You said VM and then docker? What kind of VM?

kocoten1992 · April 29, 2024, 2:50pm

Do you use any ram compression technique? like zram? I don’t have any success with those, it only make thing worse (if you use it - that could also be a reason for this).

I saw you run swap - yep, I think it is the reason for this, some program are very sensitive to latency.

consumerbot797287 · April 29, 2024, 3:53pm

The VM is for running Home Assistant in supervisor mode (a custom version of Linux I believe). It’s the primary purpose of the machine that I’m running Storj on, besides file server and DNS.

consumerbot797287 · April 29, 2024, 3:58pm

~~I’m not sure if zram is enabled. I’ll have a look~~.
Zram is not enabled. Swap is on SSD and I’m seen this issue even when it’s unused.

Seems if latency was the issue the other nodes would also have a problem, no?

I will only spend a little more time testing before giving up. Payouts were barely worthwhile before they were reduced. Now I can’t justify hours of troubleshooting the server (and its RAM consumption) for $2/month.

kocoten1992 · April 29, 2024, 4:04pm

That true, but it also a learning experience , speaking seriously, it come down to this:

github.com

golang/go/blob/16ce8b3925deaeb72541ee96b6ee23a08fc21dea/src/runtime/cgo/gcc_libinit.c#L148-L166


      
          _cgo_try_pthread_create(pthread_t* thread, const pthread_attr_t* attr, void* (*pfn)(void*), void* arg) {
          	int tries;
          	int err;
          	struct timespec ts;
          
          	for (tries = 0; tries < 20; tries++) {
          		err = pthread_create(thread, attr, pfn, arg);
          		if (err == 0) {
          			pthread_detach(*thread);
          			return 0;
          		}
          		if (err != EAGAIN) {
          			return err;
          		}
          		ts.tv_sec = 0;
          		ts.tv_nsec = (tries + 1) * 1000 * 1000; // Milliseconds.
          		nanosleep(&ts, nil);
          	}
          	return EAGAIN;

EAGAIN is often raised when performing non-blocking I/O. It means “there is no data available right now, try again later” .

So yes, when IO is slow on the system, go only try up to 20 times and fail, so it was because of swap I think.

consumerbot797287 · April 29, 2024, 4:30pm

I welcome a learning experience that could potentially be applied in the future. But otherwise, I fear it may displace some other more practical memories! XD

Well, I just confirmed that the problem occurs even with many GBs of RAM available (freed up by shutting down the VM).

Did you happen to notice that I snipped out an incredibly large amount of goroutine debug output? It’s nearly 35k+ separate threads. Is that normal?

kocoten1992 · April 29, 2024, 4:39pm

I didn’t see the number 35 anywhere, do you mean 56934 goroutines? What CPU are you using - how many cores? I want to check context switching time - could you run vmstat 1 for a while, cs is the context switching

P/s: you already said at the beginning, but I just want to double check, you have config ulimit and apply it?

P/s 2: just to make sure it not related to IO, could you try sudo swapoff -a , it a bit extreme but if we could eliminate IO, it get easier.

consumerbot797287 · April 29, 2024, 5:04pm

Sorry, the 35k+ threads was new info. I ran the debug output through grep (and wc) to count the number of goroutine processes. 56934 is the total number of processes spawned, I guess, but not all of them were actually active.

Processor is AMD GX-420CA with 4 cores. I’ve not used the ulimit command, but rather increased max kernel threads via systemctl:

echo 120000 > /proc/sys/kernel/threads-max

Here’s the vmstat output you asked for (I’ve no idea what’s normal):

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  9 323872 730400 142964 1000992    8    0 102654  1730 77239 25587 23 44  2 31  0
 2 11 323872 670156 144232 1043400   28    0 42652  3000 78420 29358 18 34  4 43  0
 0  9 323872 702748 145316 1003724   20    0 75752  1376 75900 23309 21 45  2 33  0
 0  7 323872 810780 146364 911660    0    0  2100  5948 74648 21798 17 28  3 52  0
 0  8 323872 779100 147672 913764    4    0  2960  1524 73717 20733 21 38  2 39  0
 6  7 323872 723404 148680 980476    8    0 67816  1372 79029 28958 22 35  3 41  0
 4  8 323872 609508 150212 1075532    0    0 95824  3068 76263 24514 22 42  2 34  0
11 13 323872 619576 151080 1063340    0    0 90032  6244 79903 26135 18 44  3 36  0
 0 11 323872 587648 151760 1073260    0    0 10372  5808 78866 28832 21 42  4 34  0
 2 12 323872 639328 152584 1026088    8    0 56044  2244 78246 27108 21 48  2 29  0
 1 11 323872 591172 153540 1062244    0    0 36376  1540 77616 28146 20 34  4 43  0
 2  9 323872 568356 154752 1071512   16    0  9840  2816 76346 25695 20 38  4 38  0
 2 10 323872 558652 155652 1062780    4    0 49736  1008 77988 27139 19 31  2 49  0
 0 10 323872 544736 156792 1066212    4    0 47544  2744 76564 25117 21 46  3 30  0
 0 11 323872 625868 157648 1000636    4    0 36648  4636 78585 27950 18 38  2 43  0
 1  7 323872 562892 158992 1049620    0    0 49724  2744 77030 24183 21 44  2 32  0
 0 18 323872 533652 159876 1061984    4    0 13624  2800 75068 21669 18 40  3 39  0
 2 16 323872 516648 160800 1064348    0    0  1096  2224 72788 18421 21 32  0 47  0
 5  9 323872 440900 161820 1124232   20    0 60984   988 76590 23369 18 28  2 52  0
 3  9 323872 528944 163020 1030684    0    0  8388  5912 71946 25281 20 41  5 34  0
 2  8 323872 597036 164324 969344    0    0 39376  1916 77612 27263 17 40  2 41  0
 0 12 323872 521976 165392 1044916    0    0 75136  3384 75244 26144 21 39  4 36  0
 2 13 323872 488644 166456 1058832    0    0 13960  1644 72640 19797 18 33  3 46  0
 4  8 323872 473336 167556 1060864    0    0  1376  2060 73480 19961 17 21  1 62  0
 4  8 323872 405540 168872 1115148    0    0 55676  5320 78112 29326 23 47  2 28  0
 2  8 323872 367944 170164 1129656    4    0 14353  2620 78277 29815 18 32  6 44  0
 3  9 323872 569848 170800 943880    4    0 46304  2720 76462 24149 24 54  2 21  0
 2 12 323872 530688 171788 965076    4    0 21804  2500 76901 25698 18 31  5 46  0
 0 11 323872 438388 172812 1056844    0    0 92928  1184 77189 23790 21 44  1 33  0
 2 15 323872 365304 173596 1105568    0    0 49276  8728 78504 25643 20 50  1 28  0
 2  6 323872 319048 174184 1134892    0    0 29512   728 72530 20831 22 48  2 28  0
 1  8 323872 391184 173952 1038712    0    0  7944  1197 77349 25770 20 34  9 37  0

kocoten1992 · April 29, 2024, 5:07pm

No, ulimit releated to how many file a process can open at a time, it completely different with the setting /proc/sys/kernel/threads-max, you can config that and try again, EAGAIN does related if process cannot open more file…

P/s: just to double check, you can also log into your storj container and check ulimit too.

consumerbot797287 · April 29, 2024, 5:45pm

I tried turning off swap, and have seen the container die and respawn a few times already with several GB of free RAM.

I’m going to relaunch with the higher ulimit and will report back.

consumerbot797287 · April 29, 2024, 6:29pm

Okay, I set up a higher ulimit, verified from within the container:

root@f131d74c7c31:/app# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 28950
max locked memory       (kbytes, -l) 8192
max memory size         (kbytes, -m) unlimited
open files                      (-n) 262144
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

But I left swap turned off, and now it’s getting killed by OOM killer. How many GB of RAM needs to be available for a 4TB node? With the node running, I still have 3GB of usable RAM:

               total        used        free      shared  buff/cache   available
Mem:            7376        5221         810          11        1343        1847
Swap:              0           0           0

kocoten1992 · April 29, 2024, 7:22pm

I’m not sure, but let turn on swap back and run and hope, let see if it still happen, I think you can’t cheat your way out with ssd as RAM. Also, it wear out ssd really fast, you might want to check your ssd health right now…

jarridgraham · April 29, 2024, 8:40pm

I am having a similar issue, things chug along then I get .

runtime: program exceeds 10000-thread limit
fatal error: thread exhaustion

runtime stack:
runtime.throw({0x1940048?, 0x7fcc7e9cc908?})
        /usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0x7fcc7e9cc8c8 sp=0x7fcc7e9cc898 pc=0x43c23c
runtime.checkmcount()
        /usr/local/go/src/runtime/proc.go:802 +0x8e fp=0x7fcc7e9cc8f0 sp=0x7fcc7e9cc8c8 pc=0x44006e                                                                                                                                                                                                                                     runtime.mReserveID()
        /usr/local/go/src/runtime/proc.go:818 +0x2f fp=0x7fcc7e9cc918 sp=0x7fcc7e9cc8f0 pc=0x4400af
runtime.startm(0xc000093400?, 0x1, 0x0)
        /usr/local/go/src/runtime/proc.go:2616 +0x111 fp=0x7fcc7e9cc968 sp=0x7fcc7e9cc918 pc=0x443091

I get errors like this on a couple of my nodes one after another , I am going to check my memory it may be getting low as well.

kocoten1992 · April 29, 2024, 8:47pm

Let’s fix the symptom first, you could try increase max thread limit first and see if it break anything else.

jarridgraham · April 29, 2024, 8:49pm

I have looked around how do you do that for the golang stuff? May be that simple but I couldn’t find exactly what I was looking for. All this started last week when the load went way up, glad to have the data but keeping it going is an issue now,

kocoten1992 · April 29, 2024, 8:52pm

It’s not golang, this is the operating system setting, but if issue only pop up recently (on recent storj version), you might want to create a separated thread, someone else might run into this too.

jarridgraham · April 29, 2024, 8:56pm

Yea looking around seems to be a lot of people having close issues as to this. May just have to be patient and see.