Strange error concerning graceful exit, while not graceful exiting

JWvdV · June 15, 2024, 3:58am

I get this error:

2024-06-15T00:30:28Z    ERROR   services        unexpected shutdown of a runner                  {"Process": "storagenode", "name": "forgetsatellite:chore", "error": "database is locked"}
2024-06-15T00:30:33Z    ERROR   gracefulexit:chore      error retrieving satellites.             {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:200\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:212\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-15T00:30:37Z    ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
Error: database is locked
2024-06-15 00:30:37,607 INFO exited: storagenode (exit status 1; not expected)

Why is this task even running?
Checking all databases, did not show an error.

Alexey · June 15, 2024, 4:39am

It should update these databases with info from satellites. If it cannot - it may fail.
Do you have a FATAL error around this time also?

JWvdV · June 15, 2024, 8:11pm

No I don’t have any fatal error in the same time frame (till half hour before this restart). Actually, I see this error also coming along on some other nodes. But apparently, it usually doesn’t make the storagenode restarting. So it might even be a coincidence the restart and this error coming along.

It feels like some process is locking the database. That proces is being killed for unknown reason. But then the database remains locked forever…?

Alexey · June 16, 2024, 7:23am

This should not be the case if the process is killed - it should not run and it cannot have a lock. But lock can be put by the remained main process, i.e. storagenode itself, if you disabled the lazy mode.

JWvdV · June 16, 2024, 9:34am

That’s indeed the case, but than it’s strange the system can’t cope with this expected situation?

Alexey · June 16, 2024, 10:05am

Perhaps there are some relations between the disk speed and the size of the database, see

JWvdV · June 16, 2024, 10:31am

Yeah, might be. Although I also got this error on an external USB-attached SSD. As long it doesn’t kill my node, I’m inclined to ignore it and wait a little bit until the further maturation of STORJ.

Alexey · June 16, 2024, 10:41am

Hm. SSD. Then it must not be the case.
Interesting.
How is it possible that DB can be locked on SSD?
did you check the speed?

JWvdV · June 16, 2024, 11:59am

I did iostat -x and i see no worrisome unusual use fraction and so on. But this error happens on almost all nodes now, with disabled filewalker.
The databases are on the internal SSD BTW.

Alexey · June 16, 2024, 12:19pm

Weird. Very weird. How is it possible, that my spinning disks (under Windows and worse - Docker Desktop for Windows) doesn’t have this issue?
Ok. I would ping the team. But right now it’s very strange.

nerdatwork · June 16, 2024, 12:19pm

If all nodes use same internal SSD then it might slow it down which is why the database got locked. Its usually seen in slow drives.

Alexey · June 16, 2024, 12:20pm

yes, but SSD is in 100x faster than a spinning disk, I hope?
Hm
@JWvdV what’s the model? Not something equal to a low-end SD cards, I hope?

nerdatwork · June 16, 2024, 12:22pm

Agreed. It could be dying SSD, overworked SSD so we need more info.

ACarneiro · June 16, 2024, 12:28pm

Well, you have to wonder if there is something in common with @Qwinn having locked DB errors on a highly performant system as well.
There does seem to be something fishy surfacing with the DBs

Alexey · June 16, 2024, 12:42pm

No, it’s not the same… We are talking about locks on SSD…
SSD is much more faster than HDD, usually…
But I’m agree, something is not right here.
However, I know, you host Raspberry Pi nodes (likely less powerful than any discussed setups). Do you see this problem on your nodes?

ACarneiro · June 16, 2024, 12:54pm

No, not really.
The Pi5s are dealing perfectly with the workload so far.
I only wish there was a PCI-SATA bridge for the Pi5 so I didn’t have to go via USB but otherwise all my Pi5-based nodes have been performing just fine (as far as I can tell so far)

Alexey · June 16, 2024, 1:01pm

This is exactly why I loved this one

Very simple and slight design.

JWvdV · June 16, 2024, 4:08pm

It’s happening on many nodes, which for databases are backed up by Kingston NV2, Crucial MX500 (*2) and unknown brand N900-512.

They’re all without errors in SMART. If I look in iostat, on average they delivered a throughput of 5MiB/s read and 2MiB/s write. A load of 20% according to iostat. Temperature is all around 40 degrees Celsius.

Comparing to benchmarks, these loads are still peanuts.