ERROR blobscache

flwstern · February 25, 2024, 8:25pm

Im seeing this issue in my logs, for 25 minutes it gives ERROR blobscache around 958 entries a SECOND!!

Thats like 1 437 000 entries in the log during 25 minutes?! Is absolutly hammering the ssd.

And according to my logs it only happens in the morning between 05.00-08.30

entire log: storagenode.log is available for download

2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -363264}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -362752}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -2319872}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -2319360}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -580608}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -580096}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -5376}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -4864}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -2319872}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -2319360}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -4096}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -3584}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -16128}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -15616}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -10752}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -10240}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -1280}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -768}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -580608}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -580096}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -182016}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -181504}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -1280}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -768}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -2319872}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -2319360}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -2319872}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -2319360}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesTotal < 0	{"satPiecesTotal": -2319872}
2024-02-25T07:51:58+01:00	ERROR	blobscache	satPiecesContentSize < 0	{"satPiecesContentSize": -2319360}

Im thinking this is one issue why my filewalker is not finish on this node, the filewalker is finish on all other nodes not having this issue. But i have this exact same problem on probably 10+ nodes.

Whats going on?

Knowledge · February 26, 2024, 1:20am

How many days has this been happening? It could be a correction to the sqllite cache that will fix itself, or a lot of deletes are happening, but if it’s ongoing day after day, you may have a more serious issue.

Alexey · February 26, 2024, 4:22am

These errors means that the cached values in the database didn’t match piece values on the disk. This could be if you re-created a database or your blobs folder is corrupted.
I would like to suggest to stop the node and check your disk for errors and fix them, check and fix databases, after that try to start the node. The used-space-filewalker should be enabled on start, it will traverse all pieces in the blobs and should update a database.

flwstern · February 26, 2024, 6:43am

The disk has 0 errors its new sas disks. 10 nodes on different disks.

Whats the command to check the database and which database?

If it continues after that what do i do?

snorkel · February 26, 2024, 7:17am

See the docs: how to repair malformated databases or something.
A quick guide I also put here:
https://forum.storj.io/t/database-maintenance-integrity-check-vacuum/25047

Alexey · February 27, 2024, 4:41am

enable filewalker on start (it’s enabled by default, so if you did not disable it, it should be enabled already) and make sure that it’s successfully finished for all trusted satellites, search for used-space-filewalker and finished in your logs).

If you would have errors during processing of used-space-filewalker (used-space-filewalker and failed|error|ERROR), then you would need to disable the lazy mode and restart the node, then monitor it again.

flwstern · February 28, 2024, 6:44pm

Hi,

Just wanted to post update on this. These are old nodes which has been migrated to a new storage solution which does not have the iops/latency issues the old storage had. Im guessing the old nodes never ever did trash cleanup or never ever finished filewalkers hence all the problems with missmatch.

After some solid days with error blobscache errors everything is now back to normal and all nodes is coming up with corrected data and gc-filewalker is putting up alot of trash and filewalkers is finishing successfully without any issues.

So everything fine and works.

However, I would like to expand on this point: the level of stress placed on disks may not be widely recognized by the general populace (myself included, until I began researching storage, IOPS, and latency issues), who might not understand what is necessary to maintain nodes in an optimal state. Perhaps it would be beneficial to develop more comprehensive guides on this subject. I am skeptical that a large node, with over 15TB on a single disk, could manage the read/write IOPS without experiencing significant latency and congestion. Currently, I am employing extensive caching in both RAM and high-speed enterprise-grade NVMe drives, in addition to writing smaller files to NVMe storage.

Knowledge · February 28, 2024, 7:02pm

Thanks for the follow up and suggestions. We have a couple changes coming to help reduce I/O. Hopefully this will reduce the need for SNO’s to have to intervene with configurations like yours.

flwstern · February 28, 2024, 8:13pm

In my view, your primary emphasis should indeed be on optimizing IOPS and ensuring the Filewalker resumes from where it halted due to a crash.

For us, as Storj operators, coping with such disk stress at the current payout rate is challenging to justify.

Toyoo · February 28, 2024, 10:51pm

Right now I’ve got an 18 TB drive with 17 TB worth of nodes (multiple of them, one of them open for egress). I do use an SSD cache (nothing fancy, 110 GB worth of lvmcache on a consumer SSD, so barely covers file system metadata). The drive itself is bored. iostat says around 30 IOPS. So, I guess, it’s possible. I did spend quite a lot of time refining the setup though, YMMV.

I do agree with you that a naïve setup of this size will likely be troublesome.

flwstern · February 28, 2024, 11:05pm

Yes, it’s great that your setup is currently performing well, and the addition of a cache certainly contributes to that efficiency. I’m curious, do you run your databases and operating system on the same disk, or have you separated them? It’s likely that your filewalkers aren’t running at the moment. Comparing to my own setup, which hosts 3 nodes (on that dataset) that have completed their filewalkers, I observe the following:

Disk utilization is at 10% (spinning disk).
Disk read latency is at 6ms.
Disk write latency is at 0.1ms (using NVMe for small files).

However, when the filewalkers activate, disk utilization spikes to 50%, and read latency increases to 10ms. For the average Storj operator, perhaps using a Western Digital Green 5400 RPM disk or similar, this scenario presents a challenge, especially if they haven’t separated their databases and operating system onto SSDs. Running everything on the same disk will inevitably lead to latency issues and node restarts, a common problem discussed in the forums.

Toyoo · February 28, 2024, 11:26pm

The box has 2 SSD. OS and databases are there on LVM RAID1, cache is just linear (lost cache is not a big deal).

File walkers barely touch HDDs. Even before I added SSDs, my setup was capable of file walking at 8 mins/TB of pieces, so it was rare that file walkers from multiple nodes were running together. Now it’s faster.

The HDD itself is approximately of WD Green quality anyway, being a shucked WD Elements drive.

My nodes restart only when I need them to, they don’t restart on their own. This includes node upgrades. Staged restarts prevent the initial file walkers running concurrently.

Yes, it took a bit of time to figure out all the details…

flwstern · February 28, 2024, 11:35pm

That’s not what I’m observing on my end; the filewalkers consume an incredible amount of read capacity in my setup. Writing, not so much, but I’m running Windows, not Linux, which I assume you might be using? The idea that ext4 is so much more capable of handling read operations than NTFS seems a bit too good to be true to me, but I’m definitely no storage expert. However, the efficiency you’re describing doesn’t match the data I’m seeing.

flwstern · February 28, 2024, 11:38pm

The garbage collection filewalker (gc-filewalker) runs quite quickly for me, maybe 15-20 minutes for 12TB, but the total used space calculation takes longer. However, it might only do this the first time and not on subsequent runs, uncertain. Would be great to have some better performance monitoring tools provided from storj devs.

@Knowledge comment on this and above discussion?

Toyoo · February 28, 2024, 11:48pm

NTFS spends 1kiB for each file on just inode metadata. My ext4 filesystem was formatted with inodes of 128 bytes, so each 4kiB sector holds 8 times as many inodes. This alone is a huge difference for file walkers, because they need to read all that data. You can find my measurements regarding the impact of ext4 inode size on file walking somewhere on the forum.

In addition to that, ext4 allocates space for inodes at the time of file system creation. This has drawbacks (you can’t increase the number of inodes later), but a benefit of strategically placing inodes of files from a single directory close to each other, even if they were created at different times. (though, storage nodes kinda break this scheme by using a temp directory for writes, so you need to do some more work afterwards to have it this way) Less seeks means faster file walker.

ext4 is not more capable, you can argue it’s less capable than NTFS with things like lack of file system-level compression or encryption. But if there is one thing it is better at, it is holding tons of medium-sized files.

flwstern · February 28, 2024, 11:54pm

Toyoo:

NTFS spends 1kiB for each file on just inode metadata. My ext4 filesystem was formatted with inodes of 128 bytes, so each 4kiB sector holds 8 times as many inodes. This alone is a huge difference for file walkers, because they need to read all that data. You can find my measurements regarding the impact of ext4 inode size on file walking somewhere on the forum.

In addition to that, ext4 allocates space for inodes at the time of file system creation. This has drawbacks (you can’t increase the number of inodes later), but a benefit of strategically placing inodes of files from a single directory close to each other, even if they were created at different times. (though, storage nodes kinda break this scheme by using a temp directory for writes, so you need to do some more work afterwards to have it this way) Less seeks means faster file walker.

ext4 is not more capable, you can argue it’s less capable than NTFS with things like lack of file system-level compression or encryption. But if there is one thing it is better at, it is holding tons of medium-sized files

Interesting, as I mentioned, I’m not particularly well-versed in storage, but in my case, we can say that yes, I’m running NTFS, but that’s just the overlay; the underlay is actually ZFS, where the special vdev runs a record size of 16 KiB, and the storage data is 256 KiB because I have compression enabled on ZFS. Given this setup, wouldn’t the impact of NTFS acting as it does as the overlay be mitigated? Or am I thinking about this incorrectly? Thanks for an engaging discussion.

flwstern · February 28, 2024, 11:58pm

Ah, my bad, you’re talking about 1 KiB and bytes, hehe, yeah then never mind what I said

Toyoo · February 29, 2024, 12:01am

Hard to say. Block device compression might be helping here indeed in terms of bandwidth, but it may also make random access more costly. If you experience long file walkers, this probably means the latter wins.

flwstern · February 29, 2024, 12:05am

Indeed, I’ve just migrated from a suboptimal storage solution to a better one. At that time, all dashboards indicated 0 GB of total space used until the filewalker had finished, which means it had to traverse the entire filesystem. Once that process is complete, when the total used space filewalker runs the next time, does it start from scratch again, or does it proceed in combination with the gc-filewalker? Do you have any insights on this?

Toyoo · February 29, 2024, 12:15am

Separate things with no connection. Both are called file walkers simply because both call the same subroutine from a source file named filewalker.go. Right now neither keeps progress across restarts, but there is work on this being done for both. I disabled the used space filewalker on my nodes, I don’t see the purpose of running it on my nodes. If there’s a bad estimate, that just means I’d adjust the allocated space accordingly looking at the amount of space reported by the file system as free.

The node does not actually use these numbers for anything meaningful except to compare it to the allocation size; node still accepts uploads and downloads if the estimate is totally wrong, just as long as the estimate is below allocation size. And the payouts are computed by the satellite which does not care about correctness of the estimate either. Wrong estimate only results in node showing wrong figures in its UI.