Strange error concerning graceful exit, while not graceful exiting

I get this error:

2024-06-15T00:30:28Z    ERROR   services        unexpected shutdown of a runner                  {"Process": "storagenode", "name": "forgetsatellite:chore", "error": "database is locked"}
2024-06-15T00:30:33Z    ERROR   gracefulexit:chore      error retrieving satellites.             {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-15T00:30:37Z    ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
Error: database is locked
2024-06-15 00:30:37,607 INFO exited: storagenode (exit status 1; not expected)

Why is this task even running?
Checking all databases, did not show an error.

It should update these databases with info from satellites. If it cannot - it may fail.
Do you have a FATAL error around this time also?

No I don’t have any fatal error in the same time frame (till half hour before this restart). Actually, I see this error also coming along on some other nodes. But apparently, it usually doesn’t make the storagenode restarting. So it might even be a coincidence the restart and this error coming along.

It feels like some process is locking the database. That proces is being killed for unknown reason. But then the database remains locked forever…?

This should not be the case if the process is killed - it should not run and it cannot have a lock. But lock can be put by the remained main process, i.e. storagenode itself, if you disabled the lazy mode.

1 Like

That’s indeed the case, but than it’s strange the system can’t cope with this expected situation?

Perhaps there are some relations between the disk speed and the size of the database, see

Yeah, might be. Although I also got this error on an external USB-attached SSD. As long it doesn’t kill my node, I’m inclined to ignore it and wait a little bit until the further maturation of STORJ.

Hm. SSD. Then it must not be the case.
Interesting.
How is it possible that DB can be locked on SSD?
did you check the speed?

I did iostat -x and i see no worrisome unusual use fraction and so on. But this error happens on almost all nodes now, with disabled filewalker.
The databases are on the internal SSD BTW.

Weird. Very weird. How is it possible, that my spinning disks (under Windows and worse - Docker Desktop for Windows) doesn’t have this issue?
Ok. I would ping the team. But right now it’s very strange.

If all nodes use same internal SSD then it might slow it down which is why the database got locked. Its usually seen in slow drives.

1 Like

yes, but SSD is in 100x faster than a spinning disk, I hope?
Hm
@JWvdV what’s the model? Not something equal to a low-end SD cards, I hope?

Agreed. It could be dying SSD, overworked SSD so we need more info.

1 Like

Well, you have to wonder if there is something in common with @Qwinn having locked DB errors on a highly performant system as well.
There does seem to be something fishy surfacing with the DBs

No, it’s not the same… We are talking about locks on SSD…
SSD is much more faster than HDD, usually…
But I’m agree, something is not right here.
However, I know, you host Raspberry Pi nodes (likely less powerful than any discussed setups). Do you see this problem on your nodes?

No, not really.
The Pi5s are dealing perfectly with the workload so far.
I only wish there was a PCI-SATA bridge for the Pi5 so I didn’t have to go via USB but otherwise all my Pi5-based nodes have been performing just fine (as far as I can tell so far)

1 Like

This is exactly why I loved this one

Very simple and slight design.

1 Like

It’s happening on many nodes, which for databases are backed up by Kingston NV2, Crucial MX500 (*2) and unknown brand N900-512.

They’re all without errors in SMART. If I look in iostat, on average they delivered a throughput of 5MiB/s read and 2MiB/s write. A load of 20% according to iostat. Temperature is all around 40 degrees Celsius.

Comparing to benchmarks, these loads are still peanuts.

1 Like