ERROR lazyfilewalker.gc-filewalker.subprocess failed to save progress in the database

jammerdan · June 15, 2024, 6:21am

What you could try to check and verify the values directly in the database. When the node is switched off and the value in the database is 11.2 TB then indeed when the node starts it should show 11.2 TB.

Also what you could try is to move all databases into a new folder (while node is offline of course) and let it create everything from the scratch and let it running to see if the error reproduces.

Qwinn · June 15, 2024, 6:34am

So looking for non-lazy filewalker progress, I grepped an ls -l /proc/STORJPROCESSID/fd, using the main process ID. I found only two lines with “blobs” that seemed to match what I look for in the lazy version.

lr-x------ 1 qwinn qwinn 64 Jun 15 02:29 67 -> /app/config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa
lr-x------ 1 qwinn qwinn 64 Jun 15 02:29 68 -> /app/config/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/aa

So. According to this, it would seem it’s still working on aa.

And it’s been on aa since I restarted the node. About 30 minutes now.

It’s even slower than the lazy version.

Utterly doomed.

EDIT: Couple hours later, it’s on ag. Down to 611g free space on the disk.

So so doomed.

EDIT 2: So just as a test, I shut down my other process, to see if this non-lazy walker would get any faster. Nope, at least not in any perceptible way. Stuck on “am” for at least 15 minutes since I did that. I seriously think the lazy was faster.

Doom, doom doom doom, doooooom.

Qwinn · June 15, 2024, 8:37am

As far as this advice, of moving the databases and letting them be recreated… wouldn’t it still need to do a full filewalker run in time for that to help?

jammerdan · June 15, 2024, 8:40am

Yes. It would recreate all databases empty and fill them with the values from the scratch.

Qwinn · June 15, 2024, 8:42am

Yeah, I don’t think that would help then. We’ve confirmed my databases are fine (at high cost). The issue now is there is no way to get the filewalkers to correct the incorrect stats soon enough to prevent it from trying to write to a completely full disk in, I estimate, around 18 hours.

Alexey · June 15, 2024, 8:49am

I’m sorry about that. Then you need to adjust the setup.

this is good, less points where to check.

So I would assume, that the CPU usage is not the case here, so only the disk is slow for some reason.

I’m again sorry, but I have no other ways to check the databases integrity. Because resetting the usage always related to problems with a database (either malformed, or corrupted or locked).

but at least it doesn’t generate a “database is locked” errors, right?
Could you please check that? I would like to do it myself, but my very bad Windows setup doesn’t have this issue to check…

We have a precautions checks there. The node check the actual free space on every upload. If there is less or equal than 5GB of free space either in the allocation or on the disk, the node will report to the satellites that it’s full, so uploads will stop eventually.

Likely yes, because it will run with a normal priority (affecting the uploads, but after the last changes the satellites will reduce an upload rate automatically, if your node would start falling success rate).

of course, it was not able to update databases last time.

Maybe. But why only this exact setup from thousands has this issue? Something is not revealed still.
I guess that for some reason the disk is not able to keep-up. What’s about S.M.A.R.T., is it ok?

jammerdan · June 15, 2024, 8:54am

It sounds from your error description that either the filewalker does not correctly update the databases or your databases cannot hold the values over a node restart. I would not call that fine. I have just restarted one of my nodes and could not observe the issue that you have mentioned:

All my values remain exactly as before the restart.

Qwinn · June 15, 2024, 8:57am

SMART is perfectly fine. I GAVE YOU THE SPEC SHEETS FOR THE DRIVE. It is a high speed enterprise class drive that I used for other uses for 2 years before putting on Storj. This drive can easily do 240mib/s+ 24/7. It is literally one of the best drives in the world for this.

I also noted that the non-lazy walker is NOT running any faster than lazy, in fact it seems to be going slower. But, yeah, good idea on checking for database locks.

And look at that! 20 database locks since I restarted with the non-lazy filewalker a few hours ago!

Must be my setup that’s the problem.

It is obvious that there is literally no amount of information I can provide that will convince you that this is a perfectly normal node running on specs that exceed all recommendations configured by someone who has an EXTREME amount of experience with storage and has NEVER had anything remotely like this happen in any other of half a dozen other storage-based cryptos or other uses.

It can’t be a bug on your end. It has to be my setup. No matter what.

Let it burn.

Alexey · June 15, 2024, 9:04am

Yes you gave it. I do not trust any Seagate, but it’s my personal problem.
However, your case only make me strong in my opinion about Seagate HDDs. But perhaps I’m wrong and they actually good for someone. Not for you and me seems.

this is interesting result, I wouldn’t call it as expected.

Then I think our experiment is over. You shouldn’t have any.
In this exact case the only solution is to move databases out of this disk.

Qwinn · June 15, 2024, 9:09am

I have over 20 other of this exact same model of drive. I showed you a screenshot of several of them easily running over 240mb/s, 24/7. They have done so continuously for months. And I have similar drives doing the same in 4 other machines.

Amazing that the one out of 20 or so disks of this make and model, which easily passes SMART and all testing and evaluation under any other circumstance and load, is the only one with any problem. What are the odds, right?

Alexey · June 15, 2024, 9:10am

This doesn’t matter. It cannot keep-up, you convinced me. So the only solution is to move databases out of it to any other drive, which would not stuck to update them without being locked.

Qwinn · June 15, 2024, 9:14am

Ya know, I almost forgot to mention this.

I have a couple other friends I (now regret having) convinced to join me in doing a storj node.

Earlier today, I asked them to grep for “database locked”. They get those messages. Not as many as me, but quite a few a day.

You certainly seem to suggest that that should never happen. But 3 out of 3 are getting them fairly consistently. I just happen to be the one getting it the worst out of the 3.

Must be my fault.

There is no point in reporting bugs on this forum. I’ve done support forums for projects I’ve been involved in over the decades and I have NEVER been so dismissive of someone who’s posted so much information confirming he has done everything possible to comply. I’m so done.

Alexey · June 15, 2024, 9:16am

yes, and it’s proven on my very not optimal setup. It doesn’t have them…
Unfortunately I cannot check that on RPi 3, because the SD card is not bootable and this setup in a different country…
But I cannot use my case as a template anyway.
Other 20k nodes can.

unlikely. Likely just this setup is struggling with the current load pattern. But you said that this problem exist for a while before all load tests are started.
So, just an unhappy case. Some combination of hardware/software.

Could you please try to move databases to another disk (preferably on your HBA or SSD)?

Qwinn · June 15, 2024, 9:25am

And if I take the trouble to do that (and I’m not even sure how I would rewrite my docker run to split the storage files from the databases, is there a guide for this somewhere), and the problem continues, is there ANY chance you’ll concede that there’s a bug in play here?

Cause you were sure there was corruption before, I was sure there wasn’t, I did what you said, all it did was prove I was right, and it killed my node.

If there is the remotest possibility you’ll acknowledge there’s a bug, then okay.

If there isn’t, then nah.

(BTW, it amazes me, reading through these forums and seeing tons of people struggling with similar issues, that you can act like all 20k other nodes are running just dandy. I just told you 3 out of 3 users are seeing these database locks, and you didn’t even blink.).

jammerdan · June 15, 2024, 9:31am

Qwinn · June 15, 2024, 10:07am

Thanks.

By the way… I know I seem very sure that it’s a bug and you guys may not understand why. Not sure if I mentioned it in my Database Locking Madness thread, but I just remembered why I’ve been so sure it’s a bug for months, and not a performance issue, all this time.

Because, randomly, each time I restart, it changes which database gets ALL the locking.

One restart, EVERT SINGLE database lock error, it’s piece_expiration.db.

Another restart, EVERY SINGLE LOCK is on bandwidth.db.

I’ve been a programmer for the last 42 years. I know how to tell software bugs from performance issues.

If it were performance issues, both databases would be getting affected on any given run.

But it never is. It’s always one of those databases, or the other. Every single lock affects only one on any given restart of the node.

You tell me. Is that consistent with a performance issue? Or a software issue?

Feel free to blame it on a sqlite3 bug. But I know how to vet shit and my shit is vetted.

Alexey · June 15, 2024, 10:15am

I do not see a bug, only the problem with this particular hardware, sorry.
I would say, that losing the collected stat while the DB is not able to store them I would personally consider as a not great behavior. The bug here is the hardware, which we cannot fix. However, the parameter to let the node know, that’s we are ok to grow in RAM indefinitely in hope that some time in the future the disk wouldn’t be so busy to do its work which it was designed for, could be a temporary solution (RAM is not infinite anyway).

which, again, proves that there is a bug in this particular setup. Because almost anyone must have this “bug”, which is not the case, I’m sorry. Even raspberry Pi doesn’t have it, @ACarneiro can confirm.

Qwinn · June 15, 2024, 10:19am

I see. So it’s a HARDWARE bug on my end that causes every single lock to be on one database, or the other database, on any given run.

HARDWARE can do that. That doesn’t sound like software. It must be HARDWARE that reflexively chooses only one sqllite3 database to lock up on, randomly, each time the node is started.

THAT makes even an iota of frikkin’ sense.

Holy crap, this is hopeless.

Alexey · June 15, 2024, 10:30am

This is not the problem with you. Please try to understand, some HW works way outside of the standard expectations. It’s not your fault. The HW has some issues. We cannot fix them (at least I do not have enough knowledge how to fix that the disk is too slow on writes without adding an additional layer of complexity like an SSD layer).

Please, try to move databases to another disk. Please. If it’s hard, ok. Leave it as is. It will not going to solve itself likely.

Qwinn · June 15, 2024, 10:31am

Explain to me how a disk being slow on writes could possibly only be slow on one database one run, and only slow on a diffrerent database the next run.