Unfortunately it’s related only to the disk’s ability to add bytes to the SQLite database as fast as possible when it’s struggled with other load. The CPU usually doesn’t matter much, unless your Mobo is totally relay only on CPU for any I/O operations, i.e. “windows only” Mobo or a software emulation… Usually the read-write cache should help here. If you use Linux and can add more RAM, it likely will be resolved itself. The alternative is to move databases to another disk, SSD for example (or even USB stick as suggested by several Community members if you wouldn’t mind to lose the history and the stat if it would die) or adding SSD as a cache layer (a special device in case of ZFS or SSD layer in case of LVM or Primocache in case of Windows, actually you may also use a tiered Storage Spaces in Windows too). For Windows you may also enable a write cache in the disk properties, if you have UPS (you need to select both checkboxes there):
Hi Alexey,
Thank you for attempting to explain in more detail.
I am running my node on Linux Mint 21.3. Disk was formatted as ext4 with default settings. The machine has 64GB of RAM and to my knowledge has never come even close to using more than 25GB of that.
I continued to leave my other job off to give the node a chance to survive. It actually managed to complete the Salt Lake satellite around 6 hours ago. And, astonishingly, all this did was decrease the amount of used space being reported on the dashboard. Prior to it completing when I went to sleep last night, it was saying 12.06TB used and 0.96TB trash. I woke up to the Salt Lake satellite being done and dashboard saying 11.96TB used and 1.46TB trash. It still claims 1.08TB free.
In actual fact, per df --block-size=1, the node is using 14586947227648 bytes on disk with only about 480g space remaining. (I have 14.5TB allocated, so it’s now grown well past the amount of space I allocated)
In fact, I can’t even understand this df result, now that I really look at it:
/dev/sdf 15873633447936 14586947227648 486624411648 97% /mnt/storj/storj1
Those don’t add up, at all. 15.87TB space, 14.58 used, only 0.48 available? I should still have like 1.43TB available. As of around 6 hours ago, before I went to bed, it was at roughly 600GB available, which has now dropped to 480GB. I doubt this node has a day left before it totally fills the drive, but why is it missing over a TB of available space.
As far as I understood, ext4 didn’t suffer that much from fragmentation. I’ve never seen this kind of discrepancy before on a disk in ext4/linux. What’s going on?
UPDATE: Current status of the filewalker is on the US satellite at directory vq.
Thing is, if I could get df to recognize the actual amount of space remaining, to actually show 1.43TB free instead of 480GB free, that should save the node, as it would at least have more actual free space than the dashboard is showing (1.08TB free). How can I fix this?
Okay, apparently I didn’t set -m 0 when initially formatting it. I think I did that because I’d heard somewhere that ext4 could use that space for defragmentation purposes and I didn’t want to risk exactly what seems to be going on. But, at this point, I need that space.
So, did tune2fs to 0 and I do indeed show a bunch more of the space available now.
/dev/sdf 15873633447936 14599517745152 1274098925568 92% /mnt/storj/storj1
Certainly helped. At least disk available per df now exceeds dashboard free space (though not by much, forget the 10% buffer I tried to leave). Should at least have time to finish the filewalkers now.
On virtually every other HDD I work with (over 60), I format them as -m 0 and -T largefile4, which greatly reduces inodes, because the files I typically work with are huge. I didn’t do that for Storj because I knew it could host many much smaller files and I thought standard count of inodes would be required.
I know from experience that reducing inodes greatly speeds up du, which would probably really help with filewalker runs. In hindsight, and reading forum posts that seem to suggest an inode per 4k is fine, I wish I had formatted with -T largefile4.
Unfortunately, too late, unless I shut down to copy the whole storj node to another disk formatted with proper formatting (which would be every single other disk I own, sigh)
I would recommend that proper ext4 formatting options should be added to the documentation. I have a feeling if I’d known to format my storj node like I do every other disk, I wouldn’t have had half these problems. Expecting people to dig through the forums to find them before formatting the disk for their very first node isn’t reasonable, IMO, not when there seems to be a commonly accepted standard of at least -m 0 and -T largefile4, and when they need to know about these options AHEAD of time, during initial setup, as the setting of inodes on format is irreversible. If the standard documentation doesn’t mention any specific options for a process, particularly an irreversible one, at the initial setup stage, then I’m going to assume defaults are recommended and not even know to keep looking, and I think most people would assume the same.
EDIT: Never mind the above information, made a mistake. Leaving it so following conversation makes at least some sense, but no, apparently -T largefile4 will not work for Storj.
I still have a feeling of a special isolate use case, because nobody else claimed that it should help and thus must be added to the documentation.
By the way, it’s on GitHub, you may do suggestions as a PR, our team would be happy to accept a Community contribution!
Back to the topic, it’s not confirmed, that special formatting is required for ext4. If that helps for your setups - it’s great, but, however, it will be here as a Community suggestion until properly tested by other 20k setups. I’m sorry…
EDIT: Never mind, made an error and don’t want an old post lingering with incorrect info.
I thought Storj nodes typically had 3-4mil files per TB?
Do they? I have found other posts indicating that it doesn’t reach that. If that’s the case, then never mind, I guess full inodes IS required.
Yeah, just checked my documentation and I had it in my head that -T largefile4 reserved one inode per 4k. Nope, my bad. It’s one inode per 4 megs. Which is fine for chia and a lot of other storage based cryptos but not for Storj.
4 million files per TB would suggest an average file size of 250k.
I guess I misinterpreted some of what this post suggests, or else the proper setup requires tuning other parameters, not just raw inode count.
I wouldn’t start to tune the FS, until you really hit it. At the moment you may fix the load in multiple ways, depending on what’s available for you.
Accordingly provided information, you are not limited in HW and it’s “free” and you willing to bring it to Storj (despite our recommendation to use only what’s would be online with Storj or without, but well).
From that point I would recommend to do not play with ext4 settings (make them default please to do not claim that we have a bug in your setup) and add more RAM. That’s it.
Please forgive me, If I’m wrong with my assumptions, however your setup is not standard and not recommended, I’m sorry.
Wait, what’s not standard in my setup?
Linux
ext4 on default settings (aside from me just now setting -m to 0, because I would’ve imploded otherwise)
64GB of RAM
docker settings pretty much straight out of the documentation
Please be more specific as to what you think is not standard or recommended. Because as far as I can tell, that’s as recommended as it gets.
As I’ve said, I never exceed 25G of RAM usage out of the 64GB even when I am running other things, so I really don’t get why you’re suggesting I add even more.
(If you were referring to the fs tuning recommendation I just made, yeah, I realized my mistake and edited to correct that in all relevant posts. What I’ve described here, though, is my settings that as far as I can tell are entirely recommended settings (aside from “you can’t run anything else on the CPU”) and those have led to my current database locking and other issues. If you actually have a concrete suggestion to make as to what I should change except for the seemingly nonsensical “add more RAM even though I never have less than 32GB completely free”, I’m afraid I’ve missed it.)
EDIT: Oh yes, it’s also been recommended that I put the database on an NVME. That wasn’t recommended or stated as required in the original docs I read, and if I’d known that was a requirement, I would not have signed up I’m afraid.
So, boy, was I wrong.
Despite having turned off all other processes over a day ago, to give the filewalker 100% resources to finish… and it so far having completed approximately 90% of the filewalk just for the US satellite in a practically BLAZING fast 14 hours compared to its normal run time…
I am still receiving many many many database lock errors. Well, previously I wouldn’t have called it many, but earlier I was advised that 7 over a period of several days was too much. Which apparently means that when it tries to write something to the database, and it finds it “locked”, it doesn’t save the information it’s trying to write and try to write it again when the database becomes unlocked, it just throws it away. Because that’s the only way that “7 database locks in several days is too many and can result in your issues” can make any sense.
But I just ran a grep, and in the last 18 hours, while my CPU usage has been roughly 0-1% the entire time, and where I am running nothing but Storj according to every recommendation possible except for putting the db on an nvme, or adding RAM to a system where my memory usage hasn’t exceeded 9GB out of 64GB in those 18 hours, I’m still getting 78 database lock errors.
And I am repeatedly told it has to be my fault.
I quit.
So, yeah, I’ll run your database checker thing now. Might as well, Unlikely even if it shows a problem that I’m going to restart this. Cause I did nothing wrong to create these issues in the first place. System’s on a UPS and hasn’t had a single unintended shutdown in the last 8 months.
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/bandwidth.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/garbage_collection_filewalker_progress.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/heldamount.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/info.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/notifications.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/orders.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/piece_expiration.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/pieceinfo.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/piece_space_used.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/pricing.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/reputation.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/satellites.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/secret.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/storage_usage.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/used_serial.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$ sqlite3 /mnt/storj/storj1/storj1/storage/used_space_per_prefix.db "PRAGMA integrity_check;"
ok
qwinn@Gungnir:~/storj/dbbackup$
It is hard for me to express how annoyed I am right now.
BTW, here’s a snapshot from my other process that I’ve just restarted, and will never again stop for Storj’s sake, that is currently working on 10 other hard drives in this system. All the databases it uses are also on the hard drives.
The ones reading less than 240 MiB/s average are only because they are working on much smaller data sets than the others. Every disk there would read at 240 mib/s+ if called upon to do so. Consistently. 24/7. And several of these disks are from the same exact model and batch that my storj disk came from. All enterprise HDDs, btw.
And this only uses 40% of the CPU.
And somehow it manages to do all this using less than 20G of my 64GB of RAM.
But one disk from Storj needs more than 64GB of RAM. I need to add more.
It’s because I’m doing something wrong, or my system sucks, or I don’t know how to mount a disk, or I don’t know how to set up a standard Linux install even tho it’s my actual frikkin’ day job that I’ve been doing for decades, or something.
Yeah, I still quit.
EDIT:
AND - LMAO - I just restarted the node, out of curiosity. Just to see what the effect was of me stopping and restarting the node to do your PRAGMA integrity_checks that revealed absolutely nothing wrong with my databases, which is exactly what I predicted would be the case.
Used space dropped by 1.5TB. My node’s free space is now once again higher than the free space per df. A LOT higher.
Thanks for the help!
What my post suggests is -i 65536
, which effectively means expect files that are on average larger than 64 KB. This is so far the case, though indeed the average piece size has decreased since I wrote it. On the other side, with the parameters @littleskunk is testing now I expect the average piece size to increase.
In any case, my nodes are hosted on disks formatted exactly this way and I still see >50% of inodes free.
Also, I know this won’t help you much, but I’ve learned not to care much about what the dashboard is stating. It is unreliable. I’ve got my own tooling to collect useful metrics. As such, the only “database”-ish problem I care about is whether the orders are collected correctly. That kinda matters a lot (-:
The reason I care is because it is my understanding that the dashboard used space thing doesn’t really matter much, EXCEPT that as long as it thinks free space is available, it will keep requesting more data.
Right now, because of this used-space-drops-by-terabytes-on-restart bug, my node thinks there are 2.36TB free, when in fact there are only 700g left on the disk. And my node will keep requesting data from the satellites until it literally crashes when it tries to write to a completely full disk.
I really don’t see any way to prevent the implosion at this point.
Edit the numbers directly in the database. Dirty, but works.
Though, there’s a safety check that is independent on what the node thinks the space is: it checks free space on the file system directly. It should prevent catastrophic failures as long as you can trust your file system to meaningfully report free disk space. With -m 0
this should work (though it should work as well with -m 5
, so I’d kinda be careful about that…).
Is there a guide on how, which and where to edit the numbers somewhere? I do have sql skills, tho generally only through tools like SSMS which isn’t available here, I don’t think.
As far as a “safety check”, it was my understanding that it can’t do a filesystem check when it’s a docker container, because it can’t see the device’s context outside of the container. If it can do filesystem checks, one has to wonder why it doesn’t always do an instantaneous df and make sure free space as reported on the dashboard never exceeds what’s actually left on the disk.
The recommendations you may see there:
In the bottom of the article we adding suggestions when they become common thankfully to the Community.
If you used default settings for everything and didn’t overprovisioned or overcomplicate the setup, it should work good.
this is usually not needed, unless your disk subsystem is struggling to add data to the databases. Not all setups works equally, so moving databases to SSD is requested by the Community and implemented, however, it is not added to the article above, because it’s another one point of failure.
Unfortunately I didn’t find your setup in full details, except you use a custom ext4 and Linux Mint 21.3, also you use docker and have 64GB of RAM. Also probably SSD, since you mentioned it.
Is your disk SMR? Do you use any RAID or zfs or BTRFS under the hood and then format volume/virtual disk as ext4?
How your disk is connected? Is it a network drive? NFS/SMB/CIFS/etc.?
Do you have more than a one node on the disk/pool?
Do this job also uses I/O of the disks? Perhaps the controller is a bottleneck?
If “database is locked” somehow related only to the CPU usage in your setup, could you please bind storagenode processes to the one core and do not allow to use it by this job? See bash - Assigning a cpu core to a process - Linux - Stack Overflow.
If “database is locked” happens for filewalkers only, perhaps disabling a lazy mode could help and you can still run the other job?
We must exclude the probability of database corruption, which you did. Thank you!
So you have 10 other processes uses disks. Do you use some special controller? Or you mobo has more than 10 SATA/SAS ports?
Could you please show the result of
docker stats
If the process requires so much RAM the disk is very slow for some reason. But if it’s not a VM, it should use a system cache for operations.
Then in this current situation I do not see another option except move databases to another disk with less load or to SSD. While you have database locks, accumulated usage data will be lost. It keeped in memory some time, but not forever.
However, I would ask the team, can we allow it to be not wiped on case of “database is locked” issue. I think we should have a limit here, but it could be hardcoded. Because not limiting it could trigger an OOM.
No guide, sorry. I’d just follow the code, searching for UPDATE piece_space_used
. I guess setting all numbers in piece_spaced_used.db
to exabytes should stop traffic, while giving some nice vibes on the dashboard.
The container has access to rudimentary information like free space as long as it’s a proper mount, not some weird overlay.
Beats me. A single syscall fetching data directly from RAM cache wouldn’t cost much performance.
If you used default settings for everything and didn’t overprovisioned or overcomplicate the setup, it should work good.
I did, and it should, but it does not.
“Unfortunately I didn’t find your setup in full details, except you use a custom ext4 and Linux Mint 21.3, also you use docker and have 64GB of RAM. Also probably SSD, since you mentioned it.”
I’ve reviewed that prerequisites page. Let me list my setup in terms of everything I see on that page:
Linux Mint 21.3
Node is ext4, not custom in any way, completely default format except for the -m 0 reserved space I changed it to earlier today
Storj drive is connected to the motherboard directly via SATA port
The other 10 drives you saw in my screenshot are either also through motherboard SATA ports or through an LSI 9207-8i HBA.
The disk is device model ST16000NM000J-2TW103. This is a Seagate 16TB Exos x18 Enterprise drive. Data sheet here. It is literally probably among the top 3 best and fastest hard drives one could possibly buy for this purpose: https://www.seagate.com/files/www-content/datasheets/pdfs/exos-x18-channel-DS2045-1-2007US-en_US.pdf
I do not and have not used RAID, zfs, btrfs, or any sort of logical or virtual volumes. It is a straight up vanilla ext4 format. No tricks, no special sauce. As simple and vanilla as it can possibly be.
I only have this one node.
The disk is not connecting over a network.
The LAN connection is 2.5g.
The WAN connection is symmetric 2g fiber.
“If “database is locked” somehow related only to the CPU usage in your setup, could you please bind storagenode processes to the one core and do not allow to use it by this job?”
It’s not a bad idea, but fact is, I still got 78 database lock errors over an 18 hour period when that other process wasn’t even running. When Storj was the literally only thing besides the OS that was running.
We must exclude the probability of database corruption, which you did. Thank you!
Well, the act of excluding that possibility, which required me restarting the node and yet again triggering the bug in question, which has now increased the reported free space FAR beyond what the disk actually has available, which is the explicit reason I initially gave you for not wanting to do this test when I felt fairly confident the databases were not corrupt, has completely doomed this node. Do you have a way to save it now that we’ve excluded database corruption as the cause? Cause I don’t.
If “database is locked” happens for filewalkers only, perhaps disabling a lazy mode could help and you can still run the other job?
This would be a far more palatable option if the non-lazy filewalker produced even a SINGLE line entry in the log to show me it was doing anything. Yes, I know, it’s an issue on Github. Great.
At any rate, my restart to eliminate database corruption as a culprit has exacerbated the crisis to the point that I highly doubt the non-lazy filewalker could possibly correct any of the relevant issues before my node implodes by attempting to write to a completely full filesystem. 644g left, node thinks I have 2.29TB left. Do you think non-lazy filewalker could finish all 4 satellites in the time the current crazy level of ingress will add 644g to my node? I don’t.
But. What the hell. Doomed anyway. Might as well try it. Restarting now disabling lazy.
Hey look, node thinks it’s 2.62TB free space now!
Could you please show the result of docker stats
Then in this current situation I do not see another option except move databases to another disk with less load or to SSD. While you have database locks, accumulated usage data will be lost. It keeped in memory some time, but not forever.
But since I have hopefully by now demonstrated that I literally have an OPTIMAL setup per ALL of your recommendations, that I in fact exceed recommended specs in virtually every regard, and I am still having this issue, then you could possibly stop treating it as something I need to fix, and maybe it’s something you guys need to fix?
Cause it’s not like I’m the only person reporting these issues. I didn’t even start this thread.
Thanks man. Hey, is there any non-lazy process to check the progress of the non-lazy filewalker, like you gave for the lazy version? I’ve tried grepping processes for used, walk, storj, app, blobs, etc., and not finding a process that seems to match any kind of filewalker process.