ERROR filewalker failed to get progress from database

yes, you may sort them by a filename and would see, what is folder currently processed (you need only those who have read likely). The walker is going from aa to zz inside each of the satellite’s folder.

No, you do not need to do so.
If the lazy mode is enabled, you may track filewalkers normally via logs.

Again, I’ve said this a few times, before I did this, lazy filewalker would crash right away and never start up again - I’ve waited nearly a week and it didn’t try again. Restart it, nothing, it attempts to run at startup and crashes.

Turning of lazy filewalker would give me no logs, and after 80 hours I decided that nothing is happening with the non-lazy filewalker.

After what I did, it seems like lazy filewalker is working again but it’s only been ~24 hours so I’ll give it a few more days

Version 1.107 is going change this.

1 Like

Well that would’ve been lovely and if this doesn’t work I’ll just have to wait for it to roll out… in like 3 months I guess. But I’m still hopeful on lazy filewalker, I’ll be giving it roughly 72 hours total to complete as well (currently at 28h)

Yes. If it’s failed, it will not be restarted. There are two solutions to improve the filesystem:

  • disable the lazy mode and allow it to finish;
  • set the allocated lower than used accordingly your dashboard (it will stop ingress and release more IOPS to a lazy filewalker);
  • add a cache (more RAM or use SSD if your disk subsystem allow this).

This is why I linked alternatives, how to track it. And 80h seems a very short period for this setup, unless you didn’t see a constant load of your disks.

Did this.

I have 32GB for ~ 15 TB stored - not awful amount but sure could be better.

That’s the thing, I did not see a load on them beyond ~300kb/s reads and with the fact that I had no confirmation about non-lazy working I cut it at 80 hours. Yes I could’ve let it run longer but I don’t see how it would change the fundamental issue.

From my understanding - both lazy and non-lazy use the cache to resume filewalker process right? And lazy was reporting an issue accessing or reading some previous progress (cache?) file so I am led to believe that non-lazy is running into the same issue.

Either way it seems like it’s getting close to done (or about half way depending on if we are on zh or ph for my 16TB node:

The other node (8TB) is further along and doing the numbers. I do see something interesting for the 8TB node in my logs

2024-06-27T16:11:10-04:00	ERROR	blobscache	trashTotal < 0	{"trashTotal": -162823316946}

→ not sure how much of a problem this is?
But that’s related to the trash walker not used-space so I’ll ignore that for now and circle back to it

They use a cache - yes (if something is traversed all pieces shortly before…), but not a resume, it’s not implemented/released yet:

However, the used-space-filewalker should put all metainformation to the cache, so all other operations should be run faster.

this is mean that this node has a discrepancy between the info in the databases and what’s got from the DB. So, likely it has issues with a filewalker and/or databases.

So the 8TB and 16TB finally finished meaning whatever it is I did to kickstart it (I’ll explain at bottom) worked.

The 8TB used space graph matches the OS reported usage.

The issue remains with the 16TB:


The filewalkers seem done successfully but there is still a large discrepancy:
image

There are no errors, no db locked (DB’s are 100% on a different SSD) so not sure what could be causing this?

P.S.
What I did to kickstart the process to ignore previous progress:

  1. Stop the node
  2. Cut the following db’s to a different folder:
    image
  3. Start the node (it should immediately crash
  4. Put the above db’s back into it
  5. Start node - it forgets about previous progress and starts the filewalkers as normal

So after a restart (I was adding a new HDD for migrating a small node to a much larger HDD) the 16TB used space got fixed - so maybe it needed to restarted after the filewalkers finished? I don’t know :person_shrugging:

But they seem to be fine…for now at least

2 Likes

So far as I understand, you need some time before values are updated on the dashboard. However, the restart is forced this.

Not quite,

so my initial issue was the filewalker was failing to get some file/reference and crashing immediately and refusing to start again

Something about cachePath??

To get past this I did this:

This then eventually fixed my 8TB node, the 16TB node needed a restart to update to the correct used pie values

You also did remove the safety check file storage-dir-verification, this will definitely crash the node because this file is missing.

I didn’t know it would, but that was the goal, I wanted it to crash while forgetting about any previous progress which I did accomplish :sweat_smile: