Sudden "ERROR gracefulexit:chore error retrieving satellites." restarts node

Hi,
my node is working great, suddenly it restarts its storagenode.exe (Windows GUI, 1.96.6) due to some FATAL with some gracefullexit, but i dont call any gracefullexit, why is that?:

2024-03-17T07:10:21+01:00	INFO	piecestore	download started	{"Piece ID": "W5P3XTBYYWLL5B3PBHO4L3DEG3LC55I5TW57LOSHWFWSJCULKQNA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 7936, "Remote Address": "79.127.205.227:56544"}
2024-03-17T07:10:22+01:00	INFO	piecestore	downloaded	{"Piece ID": "W5P3XTBYYWLL5B3PBHO4L3DEG3LC55I5TW57LOSHWFWSJCULKQNA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 7936, "Remote Address": "79.127.205.227:56544"}
2024-03-17T07:10:28+01:00	INFO	piecestore	download started	{"Piece ID": "2GBME22WVZTU5PUZCPJPEG4EKKH73VK5YDRMDULDSQRVZMV6KX2A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 288000, "Remote Address": "79.127.201.209:56564"}
2024-03-17T07:10:29+01:00	INFO	piecestore	downloaded	{"Piece ID": "2GBME22WVZTU5PUZCPJPEG4EKKH73VK5YDRMDULDSQRVZMV6KX2A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 288000, "Remote Address": "79.127.201.209:56564"}
2024-03-17T07:10:41+01:00	INFO	piecestore	download started	{"Piece ID": "NPRTS7LF42QV5NYVI4SI77ZP6MKCGAMSXFIFNT7EPPT3YEPAR5IA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 1369856, "Remote Address": "79.127.201.210:60262"}
2024-03-17T07:10:45+01:00	DEBUG	orders	cleaning
2024-03-17T07:10:58+01:00	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb: database is locked", "errorVerbose": "satellitesdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:192\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2024-03-17T07:10:58+01:00	ERROR	services	unexpected shutdown of a runner	{"name": "forgetsatellite:chore", "error": "database is locked"}
2024-03-17T07:11:02+01:00	ERROR	orders	cleaning DB archive	{"error": "ordersdb: context canceled", "errorVerbose": "ordersdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).CleanArchive:325\n\tstorj.io/storj/storagenode/orders.(*Service).CleanArchive:157\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func2:141\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2024-03-17T07:11:02+01:00	INFO	piecestore	downloaded	{"Piece ID": "NPRTS7LF42QV5NYVI4SI77ZP6MKCGAMSXFIFNT7EPPT3YEPAR5IA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 1369856, "Remote Address": "79.127.201.210:60262"}
2024-03-17T07:11:02+01:00	FATAL	Unrecoverable error	{"error": "database is locked"}
2024-03-17T07:13:03+01:00	INFO	Configuration loaded	{"Location": "C:\\Program Files\\Storj\\Storage Node\\config.yaml"}
2024-03-17T07:13:03+01:00	INFO	Anonymized tracing enabled
2024-03-17T07:13:03+01:00	DEBUG	tracing collector	started
(...)

it was like ~260h of uninterrupted running time, and then that happens.
The node has no history of FATAL other than that one hmmm.
The log was in DEBUG mode. hmm thats all i got
normally after orders cleaning, its got
"DEBUG orders cleanup finished {“Items delete”: 0}
this time it was “ERROR gracefulexit: chore error”
whyyy

Firstly, the gracefulexit chore runs whether or not you’ve asked for a graceful exit; it’s just periodically checking whether it’s supposed to be gracefully exiting or not.

And it’s not even responsible for the shutdown. That error just gets logged and the system tries again later. Same with the error from the orders subsystem.

But the underlying problem caused the same error to occur in the forgetsatellite chore, which periodically checks whether it has been asked to delete everything related to a given satellite. Whether yours was asked to forget a satellite or not doesn’t matter here; the chore was unable just to get the list of satellites that it knew about. That’s what caused the shutdown.

What that underlying error is, is unclear at this point. Somehow one piece of the software was trying to write to that table while another part was in the process of reading from it. Or the other way around. Further investigation will be necessary to see where this is happening and protect against it appropriately.

1 Like

well nothing cricital, it just wasted some 28h of filewalker job.
Just Can’t wait when this update comes in

for it, to be able to resume, when it stoped!

Same for me! We’re working on it.

2 Likes

this is the problem to solve

1 Like

That’s the real problem:

So, either something like antivirus locked the database or the disk is too slow.
Is it a VM? How disk is connected to that Windows?