Huge large .wal file

daki82 · July 19, 2024, 11:58pm

18 h uptime in v1.108.3
and db looking like this. its on cached nvme disk.

do i have to worry?

jammerdan · July 20, 2024, 12:47am

Only thing to worry seems to be if database implementations as correct and efficient.
There is some discussion on Github:

Mark · July 20, 2024, 12:59am

You can also try stopping the node and vacuum the database. You might want to make a backup copy first though just to be safe. This was just a test I did on a old copy of some DB that I had.
More talk about vacuum in: Vacuum databases in a ramdisk to reduce downtime

Edit: Never mind, I just realized you are talking about the wal file, not the DB itself, but you can still vacuum if you want…

Alexey · July 20, 2024, 3:40am

I do not think that you need to worry, it will be deleted on restart.

Mad_Max · July 20, 2024, 4:21am

Yes, it will be deleted when the node is restarted. But this is a sign that the TTL collector on this node has big problems - it cannot cope with deleting expired pieces and, when restarting the node, it will most likely lose all progress (work already done) and start it over again - trying to delete files that have long been deleted already.
Details can be found in the description of the problem on the github at the link above.
A fix for this issue is already in the works, but will most likely only be integrated into version v109 at best or even v110.

So it’s better not to restart the node yet - there is a chance that sooner or later it will cope and clear everything, including the -WAL file when this happens. But it really took me almost 6 days of work for one of my nodes (5700 + 3000 MB for piece_expiration db files) . And on the second (~8500 MB + 3300 MB) it is still in process - 7 days and counting.

But if you restart it, it will only get worse, because the process will restart from the beginning.

P.S.
But there’s really no need to worry too much about it. This situation does not pose any risks to user data or the risk of node disqualification.
In the worst case (if the situation does not normalize itself after a few days or week later), you will have to delete this database file (it will be re-created and begin to fill up again) and then wait a next few weeks until the expired pieces are deleted by a regular Garbage Collector instead of the TTL collector.

Alexey · July 20, 2024, 11:46am

I do not think so, actually this only mean that your node has problems to add records to the database on-line, so it’s adding them to the journal, which will be replayed on restart.
Did you move all databases to SSD to have a less latency and “locked” issues?

You are right.

daki82 · July 20, 2024, 2:59pm

No. They where there from the node start. A year ago.

daki82 · July 21, 2024, 1:53pm

However, it is solved today.