ERROR blobscache

flwstern · February 29, 2024, 12:20am

Toyoo:

Separate things with no connection. Both are called file walkers simply because both call the same subroutine from a source file named filewalker.go. Right now neither keeps progress across restarts, but there is work on this being done for both. I disabled the used space filewalker on my nodes, I don’t see the purpose of running it on my nodes. If there’s a bad estimate, that just means I’d adjust the allocated space accordingly looking at the amount of space reported by the file system as free.

The node does not actually use these numbers for anything meaningful except to compare it to the allocation size; node still accepts uploads and downloads if the estimate is totally wrong, just as long as the estimate is below allocation size. And the payouts are computed by the satellite which does not care about correctness of the estimate either. Wrong estimate only results in node showing wrong figures in its UI.

Ah, okay, I get it now, so that’s clear. But then my question is, when you refer to your filewalker taking 8 minutes per TB, are you specifically talking about the gc-filewalker? Because if that’s the case, then it seems we have similar figures. It takes me about 15-20 minutes for 10-12TB for it to complete.

Toyoo · February 29, 2024, 12:22am

This was indeed the used space file walker. It was the easiest to measure, not having to wait for the next bloom filter. Also confirmed by running du on the storage directory, it effectively performs the same thing I/O-wise as a file walker.

Garbage collection is now faster though, it doesn’t need to read all inodes anymore thanks to this commit: https://review.dev.storj.io/c/storj/storj/+/9423. My measurements were done before this commit, and at that time both types of file walkers would take roughly the same amount of time.

flwstern · February 29, 2024, 12:28am

Ah, damn, I was hoping you’d say yes Anyway, it seems like ext4 is the way to go if I’m setting up more nodes in the future. Thanks for a great conversation, but we can both admit that what we’ve discussed now has nothing to do with virtualization but rather file systems and storage design.

I’ll take some responsibility now and deepen my knowledge about inodes and sector sizes Good night from Sweden

snorkel · February 29, 2024, 7:28am

The main factor for speeding up the walkers is the ammount of RAM or SSD cache for fs and metadata. I don’t have SSD cache, and the system only uses the RAM for cache. With every GB of RAM added there is an exponential increase in speed for the walkers. I’m on linux ext4. Ofcourse you need to optimise the drive too: use the native FS for the OS, deactivate indexing and record of access time, use the native sector size, aka 4 Kib for modern drives, use enterprise grade drives (bigger and smarter cache, better hardware), use SATA or SAS connections, stay away from USB.
Check the “Tuning the filewalker” thread, for newest tests for FW, and Toyoo’s thread on “tuning ext4 fs for storagenodes”.

Alexey · February 29, 2024, 7:40am

That’s the culprit, if you would use a bare metal NTFS - you would likely never have had any issues. Do not overcomplicate things. If you forced to use ZFS, then you already uses Linux/freeBSD, why to bother with Windows VM guest? and all of these layers of complexity (ZFS → VDEV → pass through → NTFS).

flwstern · February 29, 2024, 10:17am

Alexey, to address the points raised and clarify the situation based on a factual and professional perspective:

The transition to a ZFS-backed storage system via iSCSI, interfacing with VMware ESXi, and then utilizing VMs formatted with NTFS, represents a strategic and well-considered infrastructure choice. This setup aligns with industry standards for managing scalable, resilient storage solutions.

Block vs. File Storage: Utilizing iSCSI facilitates block-level storage access over the network, allowing ESXi and its VMs to treat the storage as though it were locally attached, providing the flexibility and performance necessary for demanding applications. Conversely, file-level storage solutions like NFS or SMB/CIFS serve different use cases, often prioritizing ease of management over raw performance.
TrueNAS Core with ZFS and iSCSI: Implementing ZFS through TrueNAS Core offers efficiency through compression. Exposing ZFS storage pools to ESXi via iSCSI allows the VMs to use these pools as raw block devices, which can then be formatted with any filesystem, including NTFS for Windows VMs.
Performance and Configuration: While ZFS’s advanced features can enhance data integrity and storage efficiency, performance depends on proper configuration of ZFS pools, cache settings, and the types of vdevs used. Ensuring these configurations are optimized is crucial for maintaining high performance.
Compatibility and Administration: The setup does not inherently introduce compatibility issues; VMware handles the block storage communication, making the ZFS filesystem transparent to Windows VMs. This setup necessitates an additional layer of administration, emphasizing the importance of detailed documentation and consistent configuration practices to prevent issues.

In summary, leveraging TrueNAS Core with ZFS to provide iSCSI-based storage to ESXi, where VMs then utilize an NTFS filesystem, is a technically sound and effective approach. It is essential to ensure that the ZFS and iSCSI configurations are finely tuned to meet performance and availability requirements.

VMware’s handling of iSCSI storage abstracts the complexity, allowing Windows VMs to interact with NTFS volumes as if they were local disks, without introducing special problems.

Given the scale at which I operate, with over 60 nodes and over 1 PB of storage, relying solely on bare metal configurations is neither feasible nor efficient for my requirements.

If i would change anything it would be using ext4 on linux instead of Windows. But its to late now.

Toyoo · February 29, 2024, 11:39pm

Industry standards is a marketing term with no meaning. Note how “industry standards” vary from big metal hardware with redundant RAM (courtesy of IBM) to massive and cheap clouds with little reliability. As engineers we should be designing technical solutions based on actual requirements, not what salesmen of hardware and software tell us. Hardware or virtualization reliability is nice if you need a legacy solution that does not tolerate failures. Storj node requirements do not include that kind of resiliency: downtime is fine, many types of hardware failures are easily recoverable. Yet you pay for resiliency with reduced performance and increased cost.

This is sort of a personal battle here, sorry. I’ve faced this problem in many professional situations: management, or often even technical people parroting what industry magazines (or reddit!) claimed as “the” way of solving problems. No. Faced with requirements, let’s decide how to build a computing stack to meet those requirements, while optimizing for value brought.

Otherwise we are not engineers.

If you have a TrueNAS core setup already and hosting a node is just a side hustle (which I believe it is for almost all of us here), it’s fine to claim that you are just reusing the same setup for Storj nodes and that reduces your workload, because optimizing your workload is of value here. But it’s not fine to claim that industry standards tell you storage nodes have to be hosted on NTFS on Windows virtualized in TrueNAS on iSCSI, because this is a lie.

My bare metal setup could be scaled to a petabyte if I cared enough to steal hundreds of /24 IP blocks, and I probably wouldn’t notice bigger workload—I’m operating my nodes for 4 years now and save for power outages (for which I probably should indeed get a UPS…), and a bad memory chip (which would be a problem even for TrueNAS!), I have not experienced hardware failure so far. The annualized failure ratio for data I store so far is pretty much zero.

Alexey · March 1, 2024, 4:59am

I’m agree with @Toyoo, if your already have this complex setup, it’s fine to reuse it also for the storage node.
But if you built it only for Storj (what we are not recommend to do anyway), then it’s not performant but resource-hungry solution (I mean the performance/costs ratio here), running the node even as a pod on TrueNAS Scale would be more efficient and can use zfs dataset directly than the current one with a zvol (which will effectively disable cache in my opinion), then using iSCSI as a transport, then also NTFS and Windows VM (high resource consumption) on ESXi (another server). Just a heavy waste of resources especially if you know that the node can run even on an OpenWRT router: Running node on OpenWRT router?

flwstern · March 1, 2024, 5:49am

Alexey, I’m unsure about your background in infrastructure, but mounting disks over block storage to ESXi is industry standard. Yes, I regret using Windows as the OS for my nodes, primarily because of NTFS. They consume 300MHz and 500MB of RAM, which, while more than Linux, isn’t disastrously high. The existing environment is already in place. Why on earth would I run a node on TrueNAS/OpenWRT, which isn’t even a hypervisor? You’re off base here

flwstern · March 1, 2024, 6:07am

No i have never said that, what im trying to get to. Is that blockstorage of ISCSI (Or any other protocol) is a industry standard.

Using Windows and NTFS was indeed a bad move. Should have gone for docker.

Alexey · March 1, 2024, 7:36am

I know, that mounting disks over block storage to ANY hypervisor is considered as an industry “standard”. There are a lot of “standards”, depending on your vendor.
My background - 35 years in IT (almost all directions), so I worked with multiple standards over these years, and I know what are you talking about. However, running a node is not considered as a professional job, so you need to implement not believes of the industry but efficient configurations, because it doesn’t have requirements to separate the storage from the computing, there is also no need for high availability and seamless migration in the event of hardware failures. It’s nice to have, but not more. Starting from the point that you actually invited to run it on the online existing hardware, not to build something exclusive.

Then - it’s a different story. If you use what you have now - no problem.