Need some tricks survive with BTRFS

My node is offline and I need some urgent help to get it working again.

Last Thursday my node began trashing. I have seen this behaviour before when there is significant free space and the system tries to service multiple simultaneous upload requests at the same time (sometimes more than 23), only being able to finish very few with success, finishing most of the uploads with “upload canceled (race lost or node shutdown)”. This is a very inefficient mode of operation as very few uploads are actually comitted and disk IO is pretty high. FYI, some bad looking statistics below (the numbers are number of log lines):

     372 INFO    piecestore      upload canceled
   47356 INFO    piecestore      upload canceled (race lost or node shutdown)
   66487 INFO    piecestore      upload started
   17526 INFO    piecestore      uploaded

I am fully aware that this is likely due to the BTRFS file system I am using. Hopefuly, some day I will find a great deal for an EXOS 20TB drive, format it with EXT4 and dedicate it to my node. Until then, I am stuck with BTRFS.

Question 1: Can I set some parameter to limit the number of uploads at any time?

The problem really started when for some reason the node is resarted and the insane file walker kicks in. The file walker will cause a high read IO which, when it coincides with the afore mentioned trashing mode, causes mayor stress on the system. I have the feeling that write opertions are queued in memory due to the disk serving read IO most of the time, causing the memory to fill up until it starts to swap. At this point my NAS (Synology DS720+) becomes unresponsive and I can not even log in with SSH. After some time, something panicks and the container is killed and restarted, triggering a another cycle of the same.

I let this go on for some time until the node finally did not come back online :-(. The last words were:

2023-10-22T06:33:54Z    INFO    piecestore      download started        {"process": "storagenode", "Piece ID": "OB2V4XME7HYBZ7LFLDBUXRPBCQY6GPSFC5TNFQAZ45XLWUMIU2DA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 6912, "Remote Address": "172.17.0.1:55164"}
2023-10-22T06:33:58Z    ERROR   piecestore      download failed {"process": "storagenode", "Piece ID": "P4TMAT7ZDKMDIJCZAN3FYDLZHWVOYWU2UJVOK4XXUHYHZVCBCARA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "172.17.0.1:55144", "error": "order created too long ago: OrderCreation 2023-10-22 05:00:12.052186436 +0000 UTC < SystemClock 2023-10-22 06:33:51.18744009 +0000 UTC m=+12311.012677805", "errorVerbose": "order created too long ago: OrderCreation 2023-10-22 05:00:12.052186436 +0000 UTC < SystemClock 2023-10-22 06:33:51.18744009 +0000 UTC m=+12311.012677805\n\tstorj.io/common/rpc/rpcstatus.Errorf:89\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:616\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-10-22T06:33:56Z    ERROR   contact:service ping satellite failed   {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "attempts": 2, "error": "ping satellite: failed to ping storage node, your node indicated error code: 4, rpc: tcp connector failed: rpc: context deadline exceeded", "errorVerbose": "ping satellite: failed to ping storage node, your node indicated error code: 4, rpc: tcp connector failed: rpc: context deadline exceeded\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:209\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:157\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-10-22T06:34:45Z    ERROR   piecestore      download failed {"process": "storagenode", "Piece ID": "OB2V4XME7HYBZ7LFLDBUXRPBCQY6GPSFC5TNFQAZ45XLWUMIU2DA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Remote Address": "172.17.0.1:55164", "error": "order created too long ago: OrderCreation 2023-10-22 05:00:12.978152262 +0000 UTC < SystemClock 2023-10-22 06:34:21.775732359 +0000 UTC m=+12341.600970074", "errorVerbose": "order created too long ago: OrderCreation 2023-10-22 05:00:12.978152262 +0000 UTC < SystemClock 2023-10-22 06:34:21.775732359 +0000 UTC m=+12341.600970074\n\tstorj.io/common/rpc/rpcstatus.Errorf:89\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:45\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:616\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35"}
2023-10-22T06:35:47Z    ERROR   piecestore      upload failed   {"process": "storagenode", "Piece ID": "WBHA4C4OO24DIRVJHDZ2X3RQOK3MP2NBJI47NYGXDLAAJHAUAGCQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "error": "order: grace period passed for order limit", "errorVerbose": "order: grace period passed for order limit\n\tstorj.io/storj/storagenode/orders.(*FileStore).BeginEnqueue:86\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).beginSaveOrder:872\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:407\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35", "Size": 0, "Remote Address": "172.17.0.1:57802"}
2023-10-22T07:33:38Z    INFO    Configuration loaded    {"process": "storagenode", "Location": "/app/config/config.yaml"}
2023-10-22T07:33:38Z    INFO    Anonymized tracing enabled      {"process": "storagenode"}
2023-10-22T07:33:38Z    INFO    Operator email  {"process": "storagenode", "Address": "marc.vanwijk@gmail.com"}
2023-10-22T07:33:38Z    INFO    Operator wallet {"process": "storagenode", "Address": "0x82ECd6c0EF426Bb65094D4A1b65B2Be275E2F3a7"}
2023-10-22T07:33:40Z    INFO    Got a signal from the OS: "terminated"  {"process": "storagenode"}
2023-10-22T07:33:40Z    INFO    server  kernel support for tcp fast open unknown        {"process": "storagenode"}
2023-10-22T07:33:40Z    INFO    Telemetry enabled       {"process": "storagenode", "instance ID": "1v1o2XhkVMq7tzMCLoRVKBHVVW8LxniHfP26Jz6mgy39QjdS41"}
2023-10-22T07:33:40Z    INFO    Event collection enabled        {"process": "storagenode", "instance ID": "1v1o2XhkVMq7tzMCLoRVKBHVVW8LxniHfP26Jz6mgy39QjdS41"}
2023-10-22T07:33:41Z    FATAL   Unrecoverable error     {"process": "storagenode", "error": "Error creating tables for master database on storagenode: migrate: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250", "errorVerbose": "Error creating tables for master database on storagenode: migrate: database: info opening file \"config/storage/info.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:347\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).Migration.func1:657\n\tstorj.io/storj/private/migrate.(*Migration).Run:164\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).MigrateToLatest:370\n\tmain.cmdRun:95\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250\n\tmain.cmdRun:97\n\tmain.newRunCmd.func1:32\n\tstorj.io/private/process.cleanup.func1.4:402\n\tstorj.io/private/process.cleanup.func1:420\n\tgithub.com/spf13/cobra.(*Command).execute:852\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:960\n\tgithub.com/spf13/cobra.(*Command).Execute:897\n\tstorj.io/private/process.ExecWithCustomOptions:113\n\tmain.main:30\n\truntime.main:250"}

Question 2: From another post I understood I could start the node without the databases, let the nde create them, stop the node and copy piece_spaced_used.db back, then restart the node again. Is this accurate? If not, what would be the way to repair this?

This will not be sufficient though as it will land me back in the trashing file walker scenario. So I was thinking to turn off the file walker, let the node fill up the free space, stop the node, start the node again with file walker, let the file walker complete (more than a full day), stop the node and restart without file walker. I would then reapeat this every month or so. This brings me to:

Question 3: How can I completely disable the file walker?

Your help is much appreciated.

Marc

How is array itself configured? Raid5 or 6?

How is the subvolume that contains node data configured? You want to at least disable access time updates (mount time parameter) and disable data (and metadata!) checksums (can be applied retroactively, but Synology does not support that, so you might want to create a new subvolume (share) with correct settings and then btrfs send| btrfs receive your node data there to a new subvolume. Node can continue running while you do so, btrfs supports sending differential snapshots, just like ZFS. See my earlier post on how to do that with ZFS. Same approach.

To further reduce disk io you can mount readonly SSD cache, large enough to fit metadata, and then some.

I would not. BTRFs is far superior to ext4. You just need to confide it differently for the storagenode usecase. The default configuration is tailored for data protection at the great expense of performance. There are quite a few posts on this forum with advice configuring zfs; most of that is general and applicable to BTRFS too.

Enabling the lazy file walker should also help.

Thanks Arrogantrabbit!

I just had some TB spare on my 6TB drive, so I built my node on that. Just one 6TB SHR volume without any redundancy, together with all my other data and using stock-standard parameters so I guess time updates and checksums are all enabled. Thanks for the hint, I will have to look at that.

For my humble 2TB node, BTRFS is under normal operation good enough, even with the standard parameters. Normally I get around 30% usage when the node if full. Garbage collection will drive this up to 100% for a few hours, but is manageable. Filewalker takes 30 hours before and now with the lazy filewalker (it enabled itself some time ago), it takes a lot longer. The problem is when the file walker and heavy uploads coincide.

Looked at SSD too, Synology allows you to “pin” metadata on SSD, which looks like a compelling option. Would be interrested to see how this would accelerate the file walker and garbage collector.

Any suggestions about disabling the filewalker and fixing the databases?

Best regards, Marc

1 Like

I think this already enabled itself:

2023-10-14T22:03:24Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess     {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-10-14T22:03:24Z    INFO    lazyfilewalker.gc-filewalker    subprocess started      {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-10-14T22:03:25Z    INFO    lazyfilewalker.gc-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode"}
2023-10-14T22:03:25Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2023-10-10T17:59:59Z", "bloomFilterSize": 647948}
2023-10-14T22:25:24Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "piecesCount": 1096414, "piecesSkippedCount": 0, "process": "storagenode"}
2023-10-14T22:25:27Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}

But even the lazy filewalker combined with the massive parallel uploads will trash the system. I can set the node size to 1TB to limit uploads which will probably get me through the filewalker, but when I change the node size back to 2TB, filewalker will run again.

I need to turn that thing off.

Best regards, Marc

Perfect. SSD cache will solve this. Filewalker just reads metadata, so if it is served from SSD there would be no additional IO to the pool. Likely you’ll see filewalker being CPU limited, sending 4-5kiops to the SSD for a few minutes.

Nice!

My experience with Synology ends before DSM7, so I don’t know — is the SSD cache persistent? If so, would keep filewalker — it will force the cache to populate once and then everything will be fast from there on.

If the readonly cache is not enough — you can look at read/write cache later too: to buffer writes to some degree.

Basically the whole idea is to reduce IOps hitting the disks — the budget is very tight, 200iops total per disk, and that includes both reads and writes. Disabling checksums and atime updates does precisely that - gets rid of extra io associated with validating and updating that extra metadata.

Having few GB of unused ram helps too filesystem cache resides there, and helps offload some IO as well.

Thanks for the good ideasfor the long term.

First I have to get by node back online. Just able to start the node without fatal errors, but it immedeately goes in trashing mode:

     86 ERROR   piecestore      upload failed
    129 INFO    piecestore      upload canceled
   5341 INFO    piecestore      upload canceled (race lost or node shutdown)
      6 INFO    piecestore      uploaded
   6053 INFO    piecestore      upload started

6053 uploads started, only 6 sucessfully comitted. This is after disabling “Record File Access Time”. I cannot change the checksum settings as this requiers building the volume from scratch.

When I pick some some day in the past and look at the same counts I see a bad, but infinately better rate:

     43 ERROR   piecestore      upload failed
    146 INFO    piecestore      upload canceled
  52910 INFO    piecestore      upload canceled (race lost or node shutdown)
  94044 INFO    piecestore      upload started
  40980 INFO    piecestore      uploaded

The difference is the lazy file walker doing it’s thing. Need to get rid of it. Being able to reject excessive incoming parallel upload connections would be nice too.

Brgds, Marc

Did you try defragmentation?

You can set max-concurrent-requests to limit the number (but not throughput) of concurrent connections. Not a perfect solution, but as a temporary measure might be good enough.

You can disable an filewalker on start with option:

$ storagenode setup --help | grep scan

      --storage2.piece-scan-on-startup                           if set to true, all pieces disk usage is recalculated on startup (default true)

see

I don’t see what that thread illustrates. Are you implying that it’s an evidence that BTRFS is 10x slower than ext4? How does it make sense?

BTRFS defaults to data protection. This consumes extra IO. If data protection is not needed – relevant features, such as data/metadata dup, checksumming, CoW, can be turned off to get associated performance back. However the remaining features, such as SSD caching, differential snapshots, and better scalability are still improvement over ext4 and are relevant to storage node operators. Those features by the way are definable on a file granularity, which is quite impressive.

In other words, I don’t buy the 20 IOPS thing presented in that thread as “thats the best you can get with btrfs”. It makes no sense. (When I used Synology, running Minio, I was definitely getting adequate performance, definitely more than 20 IOPS from the array in the mixed use – but this is anecdotal and beyond the the point)

Instead of curing the headache with the guillotine I would have spent more time finding and fixing the bottleneck. I don’t know what was wrong in the configuration there, but swapping modern FS with an ancient one does not result in order of magnitude performance increase without there being some glaring underlying mismatches. The whole point of BTRFS is to fix shortcomings of EXT4. It would have been really silly if it was not faster, let alone dramatically slower, in a comparable usecase.

One of the possibilities here could be sub-optimal array configuration coupled with memory constrained host, and the observation that on small datasets ext4 will likely be winning due to smaller memory requirements, while btrfs/zfs would choke. But this is not a reason to recommend it going forward on anything but IoT. Node sizes only increase, as does traffic, and modern FSs are significantly more scalable and overtake ext4 very quickly. If you are running a node on a raspberry pi – yeah, it will choke anyway, use whatever.

I was also reading on this forum opinions that zfs is the same way inadequate for storage node, and it baffled me just the same. I observe a complete opposite. Performance is great, I don’t notice node running at all, and the server hosts quite a bit of other disk intensive stuff. But yes, some tweaking was required (nothing extraordinary that is not described in the manual); default configuration prioritizes data safety, and that comes at the price. Can hardly blame that design choice. It would have been weird if it was any other way. These filesystems are versatile and powerful, and with power comes responsibility (to configure it for your needs, until filesystems come with built-in AI to tweak everything automatically)

Unfortunately SNO here on the forum just states what they experienced using BTRFS.
I can only share what other Community members are shared.

You may just check all Topics tagged btrfs to see that.

My simple tests does show that BTRFS is slower than ext4 especially for storagenode, but you just need to test it yourself.

1 Like

Hi Daki92

Never defragged the disk, that would take ages and I fear what would happen if it conincides with a file walker. Thanks for the “max-concurrent-requests” hint. I will look into that.

Brgds, Marc

Thanks! This will be very useful if I have to restart my node form maintenance whithout having to go through the pain of the file walker.

I have actually found a workaround: I closed port 28967 on my router which blocks all up- and downloads, while the file walker completes it’s task. Once done, I will open the port again and hopefully my node will resume normal operation.

Brgds, Marc

Some short notes from the top of my head, as I don’t have time to elaborate.

  • The -t 300 thread was about a default non-Synology setup on a single HDD. As you cheerfully agreed, it’s not optimized for speed, having duplicated metadata and such.
  • To support all these fancy new features, btrfs metadata is, by necessity, larger. This results in bigger RAM usage for caching metadata, and as such, when you reach the threshold where you can no longer fit metadata in RAM cache, you fall off a performance cliff. I don’t know what’s the RAM-to-storage relation in OP’s case, nor I know what’s the threshold, sorry.
  • There is no SSD caching in kernel btrfs. There is in Synology’s implementation, probably based, to some degree, on device mapper/LVM. I did not use Synology in that thread. This is still a choice to OP though.
  • It really doesn’t help that OP runs it on SHR. Parity raids are slow to write. If this was btrfs’ implementation of a parity scheme, it would be faster—but we know btrfs parity modes cannot be trusted.
  • btrfs is crazy slow on synchronous writes, and there is no SLOG to help it, like with ZFS. I measured it. There is a pending experiment on disabling some synchronous writes in storage nodes.
  • File system design is full of trade-offs. You can support more features, but if these aren’t used by software (like storage nodes), you will still pay their price. ext4 is optimized for simple fs operations with things like collocated inode tables. This is a feature btrfs does not have, so you can’t claim btrfs is strictly better than ext4.
3 Likes

Concluding from NTFS, defrag will run in the background, at low priority, not harming any data.
IO overload is IO overload.

How much data has the node? is the 10% leftover free?

I set mine temporarily to full when i defrag them while running, so they only perform reads and deletes. For like a week at 6TB data.

Maybe you can ofload databases to an other physicaly drive.

i only know it for windows

even some usb sticks are enough for that. mine run on Samsung BAR Plus USB-Stick Typ-A, 128 GB, since august. because i did not want to bother the system ssd.
(just format one with an suitable filesystem and low clustersize eg for me ntfs,4k for you ext4)
even smaller one will go temporarily. you can move them back later, if you want.

today i saw an 128gb ssd for 9,49€ (was no option for me since i had a BRIX)

That will reduce your online score. Maybe not the best idea.

2 Likes

Hi Daki82,

The node is only 2TB and I have several TB left on the volume. I do have another drive in my NAS, but it is very old and I don’t think it will pay off moving the databases to that one.

Maybe I will expand my NAS with an additional 16GB memory or 1TB SSD.

Best regards, Marc

Hi Littleskunk,

That is the hit I was willing to take. Filewalker finished and node is stable again. I will recover from this:

Best regards, Marc

how old? (i have ssd from 2009 in the basement…256gb)
even when its on par with the node, it will reduce seek time on the node drive.
ofcourse a (256gb+) ssd/m2 is perfect for the job, holding databases and orders db.
even an usb will do it for dbs if correctly formated.

if the nas supports some kind or read caching the ssd would be nice,

with ups, the ram upgrade would be my second choice. but only if ram is full to the brim.