Database locking madness

Qwinn · April 30, 2024, 7:02am

So my node has been a nightmare for several weeks now. Trash lingers endlessly. Used space accrues but then disappears without going to trash. And when I check logs, I nearly always have a locked database. But which db gets locked changes on restarts! Sometimes it’s bandwidthdb, and sometimes it’s pieceexpirationdb. These locks can happen moments after recreating the container, but once it starts locking, it’s as if nothing in the universe can unlock it. WHY are they locked??! I don’t accept that the disk being slow or being overburdened can result in an endlessly locked database. SQL doesn’t do that.

Cancel rate has been over 75% during this. It generally takes at least 10 minutes for anything in the dashboard aside from the used space/trash piechart to fill in. Sometimes it never gets there.

Why are these databases relentlessly locking? Every thread I pull up says “check this thread for how to fix corrupt databases”, but I don’t get any of the associated corrupt database messages. It’s JUST locked, not corrupt. Endlessly. Forever.

I’ve just tried recreating the container with the switches to turn off filewalker. I don’t know what else to do. Read through a dozen threads and this database locking issue seems to be afflicting dozens of people. Oh, and the advice to “Move the dbs to an SSD” - no. Sorry. I am frankly convinced that these database locking issues are just a bug. I’m not devoting additional valuable hardware to compensate for a bug. And my node will just have to perform horribly until this bug is fixed, or an actual solution is proferred.

Sorry, but I’ve been trying to deal with this for weeks now and my patience is done. It doesn’t seem to be hurting my reputation, at least, but it has effectively stopped ALL ingress. If it were hurting my reputation or audit scores, I’d frankly just be wiping the drive and finding another purpose for it. And I really like Storj as a project! So, yeah, the last weeks, and particularly the last 4 days trying to deal with this have been that bad.

Please give me something other than “move your databases to an ssd that will just add another point of failure” or “follow these directions to fix corrupt databases even though you have not received any error messages that would confirm they are corrupt at all, just endlessly locked”.

jammerdan · April 30, 2024, 7:08am

Actually it is my understanding that a locked database means the disk is too busy.
That’s why the advice to move it to a faster disk.

My suggestion was to implement a way to run them in memory. That would be the fastest way.

Alexey · April 30, 2024, 8:53am

Than will be another “nightmare” - you will need several GBs to keep them in memory.

Hello @Qwinn,
Welcome to the forum!
To do not have “database is locked” errors you need to move them to a faster drive or increase IOPS for the currently used drive, unfortunately there is no other alternatives. And this is not a bug it’s a simple limitation of your current setup.

If you used any kind of RAID or SMR drive, exFAT under any OS or NTFS under Linux, or even worse - the network filesystems like SMB/CIFS or NFS to connect your storage - this is a culprit, and you likely can fix it.

Your providing your storage and bandwidth to the Storj network, you should expect it to be used and used hard, if your setup is not capable or you are do not want to improve it - perhaps it’s not your project.

Qwinn · April 30, 2024, 12:45pm

Hi Alexey,

I do not use RAID, or a network drive. Drive is formatted ext4, and it is connected directly to the motherboard. OS is Linux Mint 21.3. Regular docker install, no docker compose. Drive is a Seagate Exos x18 16TB drive, helium sealed CMR, advertised as enterprise class, on which I allocated 14.5TB to Storj and nothing else is occupying the disk. Rated sustained transfer rate for this drive is 270mb/s, which is actually better than my Western Digital Golds. Used space is at about 5.1TB for a month now, since due to these issues it doesn’t appear to be able to retain any ingress. Normal operating temperature of the drive is 35c. LAN speed is 2.5g, WAN connection is 2 gig fiber.

I think my hardware and setup are about as ideal for Storj as it’s possible to set up, but if you see any errors on my part in the paragraph above, do please let me know. I tend to be a bit OCD about this sort of thing. I submit that if I’m having this much trouble with the specs above, it’s probably pretty common.

I am and have been willing to invest quite a bit of time and effort maintaining my node at maximum uptime. I have even been willing to ignore that extra 1.5TB (10%) of empty space on the drive we’re told to maintain when doing my $/TB calculations. But additionally requiring those with fairly ideal setups to allocate additional space on an SSD on which will be performed highly intensive writes causing a great amount of wear isn’t what I signed up for.

All that said - after shutting off the filewalker completely last night, matters seem to have improved a bit. My cancel rate is down to “only” about 20% now, the lock messages have mostly stopped, and I’ve seen a small reduction to my trash for the first time in about a month (down to about 760GB, first time it’s been less than 910GB in weeks). I don’t know how big a deal it is to run constantly without the filewalker, but I guess I can try to turn it back on when the ingress settles back down to normal levels.

Toyoo · April 30, 2024, 9:40pm

Indeed this looks like a set up that should definitely work. Could you please run iostat like this, including all devices relevant for your storage stack (drive itself, any LVM/LUKS/whatever intermediate layers if you have any) and collect IOPS data?

This is normal. Consider that out of 110 connections that an uploader opens, only 80 finish, so 30/110 ≈ 27% is cancelled simply because your node might have been too far from the uploader, or there was a bottleneck between them and your node. 20% looks like better than average.

Speaking as a node operator, this is indeed a bit silly.

pasatmalo · April 30, 2024, 9:53pm

Another factor that could affect the load in the disk (and therefore locking of DBs) is the amount of RAM available to the system.

As the OS will cache some files from the HDD to memory, if memory is a bottleneck it can result in increased IOPS and therefore higher load to the drive, resulting in lockups.

From your setup I believe you are not using a hypervisor (e.g. proxmox), but if you were, you can also enable caching in proxmox to help further with the load to the drive.

Lastly if you dont already do, its good to keep an eye in the SMART attributes of a drive in case it could be failing. As SMART warnings can potentially predict a failing drive, it helps diagnose some drive issues. I have some drives that have slight IOPS issues which also throw SMART warnings, so the issues most probably are due to the drive starting to fail/degrade.

Qwinn · May 2, 2024, 6:33am

IOPS generally seem to be around 150/s… 250 at most. Disk util% is constantly at 92-98, because of endless filewalker. If I don’t run filewalker, it runs at 20-30. More on filewalker below. I’ll note that the %rrqm and %wrqm figures are generally red and in the 80-90% most of the time in iostat.

I must be spoiled then, for first few months I ran my nodes I was accustomed to 98%+, often 99%+. I considered anything less than 95% worrisome. Right now, I’m at 92%.

64GB.

SMART checks out fine.

Okay, so it looks like my horrible time was mostly a result of my node being restarted to update version during one of the biggest ingress spikes I’ve seen (over 160g just ingress), which triggered this endless filewalker on top of it. Once ingress died down, I restarted allowing filewalker in lazy mode. The trash filewalkers finished in reasonable time, but the used space filewalker took 6 hours to do just the Salt Lake satellite, and has now been doing just the US satellite for over 21 hours. Otherwise, node is behaving mostly normal now (as I said, 92% success rate, but I normally get 98-99%)

I have to admit, I can’t understand how this “lazy” filewalker can single handedly keep the %utilization at near 100% constantly for literal days. For only a 5.2TB / 14.5 TB node. That’s just crazy. And I read other posts where people say “it should be finished in half an hour, you’re doing something wrong!” Except people who say that never ever ever ever ever ever identify what could be wrong. At least never in any way that applies to my setup.

I’ve described my system in detail. I am overprovisioned on practically every level. If there is some aspect of a proper node setup that I have somehow missed, I haven’t seen it in countless hours of reading this forum. WHY does my filewalker take days? I could and have filled an empty 18TB from start to finish with sequential writes in a quarter of the time this filewalker is taking on a 5.2TB node.

Honestly this filewalker takes so long that I don’t see how it could ever finish. My node will almost certainly be auto-restarted to update version or something before it could happen, and then it’ll just start all over again.

CPU in question is an i9 10900 (10 core), btw.

I have seen some posts by people saying that filewalker doesn’t really need to be run more than once a year? Is that really true? If so, why would it be set to run every single time the node is restarted for a version update? Could it at least be set to only trigger a filewalker run if the node was restarted manually, and not for a version update?

jammerdan · May 2, 2024, 7:07am

Iterating of every file not just once in a while but frequently, maybe that was an idea easy and cheap to implement when nodes were at 500GB.

But now with nodes at 15TB this is pure madness. And on top the frequent restarts after updates and everything starts all over from the start except when data was kept in RAM only like with the Bloomfilter until recently fixed. Plus independent from the load aka customer usage of the node it gets hammered by filewalkers, database accesses and all.

And the thing is, you should use what you have. So it should basically run without issues on anything without the need to invest into SSDs, RAM or whatever.

I agree, all of that sounds totally crazy.

But this

seems to be a good thing:
1556872 – iostat %wrqm coloring misleading

The attention-marking is misleading: a number approaching 100% is not at all a bad thing -
in fact it’s good that the system can reduce the number of IOPS by merging sequences.
It’s far from clear that any range of values in this field can be considered as either
acceptable or bad.

Edit: We are looking very much forward to have this stop and resume feature ready and deployed:
save-state-resume feature for used space filewalker
Can’t stress enough how much this is needed!

Ruskiem · May 2, 2024, 9:18am

Dude, of course it should be turned off at startup “storage2.piece-scan-on-startup: false”
it doesnt matter for the payout, its just for local system to know, if free space left, is correct.
If You have plenty of free space, You should NOT bother.
Are You using that 18 disks in 1 PC machine?
The bottleneck for counting small files can be in motherboard and the controllers capability.
It reminds me my setup, but i use VMware, its fine when i run 1 instance, the filewalker is fast 30min, but when i run 13 instances of VM x 16TB, theeen filewalker on everymachine, and its like 60-76h+, for what once took under 1h.
So yeah, many bootlenecks just in speed of RAM, chip design inadequate for such setup, and controllers.

Qwinn · May 2, 2024, 2:55pm

Alright, thank you for explicitly confirming this. That did seem to me to be the case - as long as it’s not getting close to running out of allocated space, not worth the extreme pain. So I am going to try to let this current run finish (30 hours into trying to get it to finish just the US satellite, after 6 hours for Salt Lake!), then turn it off until I’m within a TB of filling my node.

Interesting re: the mobo/controller possibly being the issue, had not considered that. I have 10 HDD in this machine. 6 by direct mobo connection (including the storj node of course), 4 on a cheap sata expansion card, with all but the storj node being chia drives. The amount of reads on those 9 other drives are so trivial that it didn’t occur to me it could actually be relevant.

Qwinn · May 2, 2024, 4:20pm

Is it possible that this is a CPU bound process? Considering the constant 95% util in iostat -x, I would’nt have thought so, as that implies it is IO bound. I am running other processes on the box though, totally unrelated to Storj. CPU usage is generally around 90%, but I did make sure to leave a couple of threads free for Storj to use. When I stop the process causing that usage, and run s-tui, I see virtually no cpu utilization whatsoever on even a single thread, even with the storj node and filewalker running.

For the record, the disk i/o required by those other processes I’m running is seriously minimal. There are NFS reads on chia drives, and a process that writes about 1 gig every 3m30s, and that’s it. All to drives other than the storj drive, of course.

Other processes don’t use that much RAM either, btw. At least 50GB of the base 64GB continually available.

I have stopped that process and am willing to keep it off for a few of hours just to see if that actually allows the filewalker to finish, but I’m not hopeful.

Ruskiem · May 2, 2024, 4:38pm

the late updates brought us some problems with CPU usage,
i saw some post, 2 my nodes was affected as well: no abnormal traffic, but the CPU for Storj was 2 times higher. Restarting a node solved that for me i guess.
You gotto scan treads about good node practices here, and @Ottetal post as well.
Being a SNO is like a mad scientist, You gotto do experiments to find optimal settings

Qwinn · May 2, 2024, 5:13pm

I read these forums frequently. Finding well laid out best practices for a linux node has actually proven quite difficult - I see several threads for windows but not really seeing any dedicated to linux except for specific setups like virtualization that I’m not using.

Having a direct mobo connection and leaving 10% free space, that’s about all I’ve read on these posts that has been applicable to me. When searching for such advice, I wind up spending a great deal of time on posts talking about things like needing a decent amount of RAM, or not having node on a network share - which isn’t applicable to me, since I do, and I don’t.

I gave a pretty detailed description of my setup. If you can think of anything I’ve missed or failed to do (aside from this new advice you’ve given me that I didn’t find explicitly stated before, that it really is okay to leave the filewalker turned off very long term), please do share.

Anyway - I guess it WAS the CPU processes I had running, because after shutting down that other process, the US one quickly finished (after having been running for 36 hours), and then the EU satellite, with about half of the US’s storage, finished in just 10 minutes! Wow. So. On the one hand, great to find out what was causing the problem. OTOH, if the Storj node requires the entire CPU to itself to work properly, not so great.

I will try testing with the “lazy” feature turned off, see if that gets a more balanced workload distribution.

Toyoo · May 2, 2024, 10:15pm

Frankly I hoped you’d paste a screenshot or sth, so that all numbers across the whole stack are visible.

“lazy” is a bit of a misnomer. It is deprioritized, i.e. the operating system is told to run it only when there are no other activities. But it will still fill up any free IOPS.

This is, on the other side, a problem. At 5.2 TB and with so much RAM the file walker should take tens of minutes, not days.

Each setup is different, and we can only offer guesses as to what happens in your system. And, frankly speaking, I’m myself too tired of trying to do the intelligent telnet method to debug other people’s problems, so I’m not trying myself, sorry.

Compute capacity should not matter. I’m running >50 TB on N36L, which is worse than RPi4. However, there are I/O bottlenecks that masquerade as high CPU use, and these are ugly.

Alexey · May 5, 2024, 7:38am

It can be true for the used-space-filewalker, if you did not have a discrepancy in the used and showed as used before and if you do not have a side load (Chia mining, your own usage, etc.) on the same disk.

Yes. Set this option to true (or remove it):

If you use a config.yaml, this option would be

storage2.piece-scan-on-startup: false

so you need either set it to true or comment out, save the config and restart the node.

Alexey · May 5, 2024, 7:41am

This is how the load on disk is displayed. It may use only a few KiB/s, and few IOPS, but if it do use a disk to perform this, the disk usage will be 100% every time.

Alexey · May 5, 2024, 7:45am

oops. This could be a root cause of all your problems. The cheap cards do not have a separate controllers and use the same bus, thus everything will be (at least) in 4 times slower and buggy.

Qwinn · May 5, 2024, 8:00am

This seems… odd. Wouldn’t disk utilization always then be either 0% or 100%? Because I see numbers in between all the time, except when filewaker is running.

Qwinn · May 5, 2024, 8:00am

Wait, are you suggesting that the mere existence of one of those cards in the PC will negatively affect the Storj node that is on a direct connection to the motherboard, as noted in the sentence prior to what you quoted?

Qwinn · May 5, 2024, 8:05am

About that. I disabled the used-filewalker quite a few days ago, with the parameter you described (thank you). Since then, if I grep the logs for “walk”, I don’t see anything. I seem to remember seeing the trash filewalker (which generally always ran fairly quickly, unlike the used space one) run something like daily. I wouldn’t mind if the trash one kept running. Is there a way to turn off the used space one without turning off the trash one?