300 fault files in /storage/blobs folder

Running node on TrueNas. Noticed that node stopping growing. It stays steady at 3.4 TB out of 12 for about a month. Then I’ve noticed, that ZFS reported 300 files with permanent errors in folder data/storage/blobs. Running full disk option.

How bad is having those 300 files faulty and could it be reason by node is not growing up anymore?

300 broken files is below noise level. But you need to figure out why files were allowed to get corrupted in the first place. Do you not have healing?

No, it can’t be the reason. The growth/shrinkage is dictated by customer activity and your node performance. You can search “success rate” on this forum to find a script to check that you don’t have excessive number of lost races, that may point to likely high latency either in your network or storage path.

Also, do you have a free space or not and is it updated to the current minimum version or not.
You may use a multinode dashboard, it shows the free space, which the node would report to the satellites.
The current minimum version you can see there: https://version.storj.io/

Hi,

I seem to be having the same issue, the last scrub task reported that I have ~400 checksum errors. I am also running Storj as a TrueNAS app but I am pretty new to Storj, is there something I should do to get rid of these errors?

The drives themselves do not seem to be failing, all SMART tests come back fine.

I think the only way is to remove files and then reset errors in the terminal with zfs clean command.

Running v1.126.2
Have 6.5 TB free space. Not Multinode Dashboard, but Prometheus - good enough?
Switched off full disk mode - will observer now how it goes. Disk now have almost 100% usage for hours and hours, crazy.

It’s likely a file walker. You either need more ram, or L2ARC, or special device: accelerating access to (a lot of) metadata makes all the difference in viability of the node. Your HDD shall be seeing 20-40 IOPS from node, all the metadata related IO (that can reach several thousands IOPs) shall be served from SSD.

The corruption however is still an issue that needs addressing. I would start with replacing SATA cables (buy some used ones from enterprise equipment, otherwise I found there no way to source them for average person and not end up with utter junk, regardless of price)

1 Like

Thanks for the advice, I might consider it in future. For now - I’m waiting for the startup scan to complete. I was wrong that note is not growing - technically it is, but for a very small amount. From 3.28 TB beginning of the month to 3.46 today. I have 250 mbit internet connection both uplink and downlink.

DON’T. USE. ZFS.

that’ll save your node.

Seems that everything is fine with the node, just the network is not growing at the moment. I will keep for a while longer

Thank You for advice. In my case - it hink the problem case from the moment, when server got frozen and I had to dard-reset it

Horrible advice. Does it stem from ignorance or malice?

1 Like

That’s a problem. Why did it freeze? Configure it to capture kernel core and reboot next time, so you could investigate.

You seem to have hardware issues, please fix them first. Hard reset shall not result in data corruption either. But if you have, say, bad ram — it could result in both.

Run the memory test (like memtest86). Replace all sata cables, as suggested above. Maybe even replace power supply.

It does not matter. Node traffic is much lower. But your node can still be deprioritized if it is slow, and you will be getting progressively less data. The advice above stands — accelerate your metadata access. You can even send small files to special device. You have the best filesystem there is for this usecase — use its capabilities to your advangage. Optimizing storagenode will also accelerate your other workloads significantly.

Will dive deeper into that, thank you. The goal of the topic was not to get troubleshooting advices but understand whether filling of the node have stopped.
Regarding the l2arc for metadata - I’ve been running it for a while, it ate up my consumer grade SSD in months. Then - You have to use enterprise grade SSD, which is quite pricey and buying that - you remove any potential profit in mind. In this regard yes - ZFS might be not the right file system because if You want to do it right, You do it expensive…

1 Like

100% disagree.

  • used Enterprise SSDs on eBay are very cheap. You can get 100GB enterprise SSD with PLP for $10. It will be enough for metadata. If you get larger one for slightly more money — you can fit small files too. If your users will benefit from that, I don’t know what is the purpose of your sever.
  • L2ARC is less helpful than special device, and if you have so much write traffic on L2ARC it means it’s too small for your usecase. Writes to L2ARC are rate limited anyway. There are many other things you could optimize, search this forum for “zfs”. Premature wear is also likely caused by wrong ashift
  • don’t expect any profit. Storj helps to offset costs of running the server by making use of already online underutilized storage. If running hardware would be profitable — why would storj need you? They would run datacenters themselves.
  • yes, you have to do it right or not do it at all. If your current setup works fine for your usecase, but does not with storj — don’t run Storj on it. But I doubt your users are happy with your array without any acceleration, and hardware issues that cause hang. So my advice was to fix it for your users. And as a side effects it makes it great for storj too. As a side effect. Not as a primary goal. Storj load is very minor, any well configured hardware with zfs shall be able to effortlessly handle it.
2 Likes

It’s from experiences. Trust me. Keep your setup as simple as possible, as “default” as possible. Don’t try to use ZFS if you don’t have hardcore hardware, you’ll regret it.

Respectfully, I don’t trust you, and my experience differs drastically.

If you want, I can offer some guidance on configuring your freebsd/zfs system properly, in addition to what was already posted on this forum, but claiming that zfs somehow requires “hardcore hardware”, whatever that means, is at least disingenuous. Obviously, we are not talking about 512GB RAM arm SoC here, if you have that – sure, don’t use zfs; simpler filesystems (ext4, xfs, fat, etc) will work better on these systems; and you will be limited to only light workloads anyway, so the “performance” discussion is moot.

But once you move to something more modern , you will find zfs scales much better, and those older filesystems become a bottleneck. For example, raspberry pi 4, with 8GB of ram, with HDD for storage and a small SSD as a special device can be made into a very lightweight and snappy home server.

I agree. Instead of concocting complex solution on top of legacy filesystems, such as volume managers, and add-on cache layers, use zfs, that gives you a direct way to accelerate metadata access, is simple to configure, and works well in default configuration out of the box.

And this is even before we discuss snapshots, replication, healing, and other perks modern filesystems shall provide. Remember, you are not building a storj node, you are letting storj use space on your allegedly already running home server. And if your home server is not running ZFS right now – you are doing it wrong, it’s a disservice to yourself and your users. Heck, even UnRaid caved and added zfs support. Think about it.

For my understanding - metadata caching would make filewalker finish quicker, is that it?

The rest of the time my HDD have low iops, so I don’t think it’s getting bottlenecked somewhere while filewalker is not running.

Customer requests a piece. Your node scrambles to find the location of that piece — 20-30 ms seek. Then to fetch the data from disks — another 20-30 ms delay. Customer cancels the transfer because they have already received that piece from another node, that did not waste time on metadata fetch (see long tail cancellation). Your node did not get paid for egress.

Same scenario happens when uploading the pieces. Transfers get cancelled, your node does not get to store the piece.

Over time, satellite noticed your node is slower to respond than your neighbors and sends you less data.

Forget storj, your server can be much more responsive with fast access to metadata. Why would you not want that for the users, especially if the solution is $10 SSD away?

2 Likes

Can you suggest some specific models? I cannot find anything at this kind of price.