Sudden "ERROR gracefulexit:chore error retrieving satellites." restarts node

Ruskiem · March 18, 2024, 11:04am

Hi,
my node is working great, suddenly it restarts its storagenode.exe (Windows GUI, 1.96.6) due to some FATAL with some gracefullexit, but i dont call any gracefullexit, why is that?:

2024-03-17T07:10:21+01:00	INFO	piecestore	download started	{"Piece ID": "W5P3XTBYYWLL5B3PBHO4L3DEG3LC55I5TW57LOSHWFWSJCULKQNA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 7936, "Remote Address": "79.127.205.227:56544"}
2024-03-17T07:10:22+01:00	INFO	piecestore	downloaded	{"Piece ID": "W5P3XTBYYWLL5B3PBHO4L3DEG3LC55I5TW57LOSHWFWSJCULKQNA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 7936, "Remote Address": "79.127.205.227:56544"}
2024-03-17T07:10:28+01:00	INFO	piecestore	download started	{"Piece ID": "2GBME22WVZTU5PUZCPJPEG4EKKH73VK5YDRMDULDSQRVZMV6KX2A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 288000, "Remote Address": "79.127.201.209:56564"}
2024-03-17T07:10:29+01:00	INFO	piecestore	downloaded	{"Piece ID": "2GBME22WVZTU5PUZCPJPEG4EKKH73VK5YDRMDULDSQRVZMV6KX2A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 288000, "Remote Address": "79.127.201.209:56564"}
2024-03-17T07:10:41+01:00	INFO	piecestore	download started	{"Piece ID": "NPRTS7LF42QV5NYVI4SI77ZP6MKCGAMSXFIFNT7EPPT3YEPAR5IA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 1369856, "Remote Address": "79.127.201.210:60262"}
2024-03-17T07:10:45+01:00	DEBUG	orders	cleaning
2024-03-17T07:10:58+01:00	ERROR	gracefulexit:chore	error retrieving satellites.	{"error": "satellitesdb: database is locked", "errorVerbose": "satellitesdb: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:192\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2024-03-17T07:10:58+01:00	ERROR	services	unexpected shutdown of a runner	{"name": "forgetsatellite:chore", "error": "database is locked"}
2024-03-17T07:11:02+01:00	ERROR	orders	cleaning DB archive	{"error": "ordersdb: context canceled", "errorVerbose": "ordersdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*ordersDB).CleanArchive:325\n\tstorj.io/storj/storagenode/orders.(*Service).CleanArchive:157\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func2:141\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2024-03-17T07:11:02+01:00	INFO	piecestore	downloaded	{"Piece ID": "NPRTS7LF42QV5NYVI4SI77ZP6MKCGAMSXFIFNT7EPPT3YEPAR5IA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 1369856, "Remote Address": "79.127.201.210:60262"}
2024-03-17T07:11:02+01:00	FATAL	Unrecoverable error	{"error": "database is locked"}
2024-03-17T07:13:03+01:00	INFO	Configuration loaded	{"Location": "C:\\Program Files\\Storj\\Storage Node\\config.yaml"}
2024-03-17T07:13:03+01:00	INFO	Anonymized tracing enabled
2024-03-17T07:13:03+01:00	DEBUG	tracing collector	started
(...)

it was like ~260h of uninterrupted running time, and then that happens.
The node has no history of FATAL other than that one hmmm.
The log was in DEBUG mode. hmm thats all i got
normally after orders cleaning, its got
"DEBUG orders cleanup finished {“Items delete”: 0}
this time it was “ERROR gracefulexit: chore error”
whyyy

thepaul · March 18, 2024, 1:17pm

Firstly, the gracefulexit chore runs whether or not you’ve asked for a graceful exit; it’s just periodically checking whether it’s supposed to be gracefully exiting or not.

And it’s not even responsible for the shutdown. That error just gets logged and the system tries again later. Same with the error from the orders subsystem.

But the underlying problem caused the same error to occur in the forgetsatellite chore, which periodically checks whether it has been asked to delete everything related to a given satellite. Whether yours was asked to forget a satellite or not doesn’t matter here; the chore was unable just to get the list of satellites that it knew about. That’s what caused the shutdown.

What that underlying error is, is unclear at this point. Somehow one piece of the software was trying to write to that table while another part was in the process of reading from it. Or the other way around. Further investigation will be necessary to see where this is happening and protect against it appropriately.

Ruskiem · March 18, 2024, 1:43pm

well nothing cricital, it just wasted some 28h of filewalker job.
Just Can’t wait when this update comes in

github.com/storj/storj

[storagenode] Lazyfilewalker optimization

opened 12:39AM - 21 Jan 24 UTC

profclems

This is an ongoing discussion on the forum with great suggestions coming from SN…Os. For larger nodes, the lazyfilewalker might never complete for some of the satellites. It might keep running for weeks and then once the node is restarted (especially for docker nodes that are auto-updated and restarted) the Lazyfilewalker would have to start again. ## Possible Solutions ### Used Space Calculation - The lazyfilewalker should never start from the beginning after an interruption. It should periodically save its state and resume from the last one after restart (Ref: https://forum.storj.io/t/debugging-space-usage-discrepancies/24874/90?u=clement) - https://forum.storj.io/t/debugging-space-usage-discrepancies/24874/108?u=clement - https://forum.storj.io/t/debugging-space-usage-discrepancies/24874/106?u=clement TLDR; - the filewalker should be able to resume after being interrupted - we do this by taking advantage of the two-level directory structure - read the two-letter directory names from the satellite blobs folder and sort them alphabetically - Save the state after walking through each 2-letter directory - to reduce the mismatch for uploads/deletes during the file walker, for each upload/delete, check whether its two-letter directory has been scanned or not. If it was, update the used space cache PS: We don’t need the filewalker (used space calculation) to run on each restart. Maybe we can get it to only run once if it has never run or the used space db is inaccessible/malformed or empty. If we can guarantee that every other read/write operation updates the usage cache, we wouldn't need to rely on the filewalker ### Garbage Collection - GC filewalker should periodically save the state after walking through each two level directory

for it, to be able to resume, when it stoped!

thepaul · March 18, 2024, 4:08pm

Same for me! We’re working on it.

daki82 · March 18, 2024, 8:30pm

this is the problem to solve

Alexey · March 19, 2024, 4:27am

That’s the real problem:

So, either something like antivirus locked the database or the disk is too slow.
Is it a VM? How disk is connected to that Windows?