Node hang when received "WARN retain failed to delete piece" warning message

Hi I notice my node hang and down for more than 5 hours and base on the log I do see this Warn msg repeated

Blockquote
2023-12-21T00:00:49Z WARN retain failed to delete piece {“process”: “storagenode”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “”, “error”: “pieces error: pieceexpirationdb: context canceled”, “errorVerbose”: “pieces error: pieceexpirationdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).Trash:112\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:403\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

This never happen to me before, but happen recently could be relate to new node version 1.93.2

1 Like

I’m seeing the same thing on my node

1 Like

Does your node also hang ?

This warning mean that the node cannot access a database file.
I would suggest to stop and remove the container, then check and fix the filesystem, after that check and fix databases

If you would not like to fix corrupted databases (if you have the corruption), you may also re-create only corrupted instead, but you will lose the historic and Stat data:

If they are not broken, you do not need to fix them though.
However, the node will not stop because of warning. You need to search for FATAL error in the node’s logs or OOM events in the system journals.

The problem with me is my node will be hang at that point
Is this also a expected behavior?
IMO, I think this should be error since node is hang at that point

This is usually a hardware issue unfortunately. So I would suggest to check and fix the filesystem on the disk first.
Some devices can hang if the disk is not responding, and this is usually mean that this disk is dying. The other case if your device has not enough free RAM and swap either disabled or too small, in this case the device like Raspberry Pi may hang as well.

Ok, thanks I think problem for me seem relate to memory, all disk check ok for now, I do see storagenode using large amount of memory at that point even thought I still have plenty of RAM

This is an indication of a slow disk subsystem by the way. It should not consume a lot of RAM (my nodes uses 150MB each up to 300MB).
How your disk is connected?

It depend on the load I get but could be from 100MB to 1.5 GB

i can confirm, i spotted same, yesterday on one of my nodes(21.12.2023, 04:25:02 GMT+1), it got 1.93.2 so maybe related.
Its Windows GUI, i had to restart the computer, because the storagenode service was hanged, (in stopping mode) And the log was stuck on WARN retain failed to delete piece’s, and the HDD was working still normally, i just restarted the windows and working normally again. no hardware problems spotted. Can provide log if needed later.

This is weird. I do have a Windows service as well and it’s updated to 1.93.2 recently (several days ago) and I did not restart this server for at least 1.5 years. It just works… Not a big node though, only 1.4TB.

I’m using docker on linux, my node is 4TB

I do have docker nodes as well, and they are running inside the Docker Desktop with an WSL2 engine on my Windows server.
However, I have had a Linux node on Raspberry Pi3B+ device with the 2TB disk. It hanged when the node used more RAM than is available (this model has only 1GB of RAM and only about 800MB was free), so I forced to use a memory limiting option to do not kill the system and kill the node instead, if it would consume more RAM than 800MB, this was done with the option --memory=800m (see https://support.storj.io/hc/en-us/articles/360026612332-Install-storagenode-on-Raspberry-Pi3-or-higher-version).
You may limit the memory usage for the container, if you have not enough RAM to run. But I suppose you have much more memory on this host, so perhaps it’s not needed at all.

Are you sure that this warning is the last message before hang?
What do you have in the system logs on that time?
You may use the command journalctl to search events on this time.

Did you check databases as well?

Yeah, my server have quite large amount of RAM so I don’t mind to limit it
DB on my node is quite new, I got an issue with db like 2 month ago so I already delete it and re-create from begin so I don’t think with that newly created db, we could have an error with it

I could check it again, but since the hang issue I got I don’t want to mess around so my node will get lower uptime rate

The databases check is quick enough, it will just check a consistence and that’s all.

A common problem I’ve had with various linuxes is an extreme slowdown or out right lock up when running near maximum memory usage, usually affected by how much swap space I have and other issues.

If this is still happening to you, would you be able to share a graph of memory usage on your storage node around the time of the hang? I don’t know WSL in Windows very well, but both Windows and Linux should have applications to capture graphs over time of memory and other resource usage.

I am worried that perhaps our garbage collection process is using more RAM than we expect, or perhaps has a memory leak.

Thank you!

2 Likes

I’m now writing a script to auto restart a node when It consume more than 2Gb of RAM, because I see a problem happen when it consume around 2.5Gb of RAM

I might check the db next month when I more score it better

You may specify the --memory 2GB option in your docker run command somewhere before the image name.
This will do exactly the same - send a signal to close the node, if it would consume more memory than you specified. But likely a garbage collector should not allow to grow it more than to 2GB.

I have the same problem during GC. We need a better managment of memory I think.