Unlink performance on FreeBSD

arrogantrabbit · September 7, 2024, 7:47am

Last few days node is busy deleting files from trash. But it seems to be doing it quite slowly - about 200-300 files per second.

Looking at how much time is spent in sys_unlink it’s clear it’s the bottleneck here – taking 900ms out of each second:

Counting deletes
Storagenode(s) deleted 219 files per second
Deletions took 911 ms per second
Storagenode(s) deleted 223 files per second
Deletions took 922 ms per second
Storagenode(s) deleted 229 files per second
Deletions took 923 ms per second

dtrace script

#!/usr/sbin/dtrace -qs

BEGIN
{
    printf("Counting deletes\n");
    tm = 0;
    c = 0;
}

fbt:kernel:sys_unlink:entry
/execname=="storagenode"/
{
  self->t = timestamp;
}
fbt:kernel:sys_unlink:return
/self->t != 0/
{
   tm += (timestamp - self->t)/1000;
    ++c;
   self->t = 0;
}
tick-10sec
{
    printf("Storagenode(s) deleted %d files per second\n", c/ 10);
    printf("Deletions took %d ms per second \n", tm/1000/10);
    tm = 0;
    c=0;
}

I can’t upload the interactive flame graph on this forum, but it looks like so:

If I’m reading the flame graph correctly (note, it’s not in chronological order), a lot of time is spent synchronously reclaiming resources.

I’m not sure if this is fundamental issue, or I have something misconfigured, or unlinking files can be made more performant; and whether parallelizing it will help in any way (I’m going to experiment with this).

In the mean time putting this out here in case someone with zfs experience has some ideas…

Alexey · September 7, 2024, 8:55am

It’s like that on Linux and Windows too.
Seems all FSes are struggling with deleting a big amount of small files.
I do not have any suggestions how to improve that.
We need to stat each piece, to include its size to the amount of the deleted data to then update databases… Thus it’s so slow.

We have discussed several ideas internally, like do not perform any stat in any condition (but the requirement is to allocate the whole disk/partition), or to go with a On a statistical approach for the used space file walker (the team is excited about the idea, by the way, thanks @Toyoo, your ideas are very appreciated and we did not forget about you other nice idea Design draft: a low I/O piece storage !)

Still WIP though.

arrogantrabbit · September 7, 2024, 4:39pm

nono, that part is is not a problem, it’s fast. Storagenode is running full speed, consuming 100-120% of a CPU, but 90% of time is spent inside unlink syscalls. (To eliminate time of roundtrip to kernel from the measurement set probe on kernel unlink function, see above).

I.e. storagenode would delete much faster if unlink syscalls would complete faster. I’m not sure if parallelizing calls would help – there are 48 cores that are pretty much idle – it would depend on whether slow bits are under locks and have to be serialized.

Alexey · September 8, 2024, 2:40am

I do not think that we use a parallelism in the trash cleanup function. Only direct deletes have it and it’s even configurable.

Egon · September 9, 2024, 7:27am

I.e. storagenode would delete much faster if unlink syscalls would complete faster. I’m not sure if parallelizing calls would help – there are 48 cores that are pretty much idle – it would depend on whether slow bits are under locks and have to be serialized.

The main reason we avoided parallelizing the thrashing or walking over the pieces, is can easily end up saturating the disk iops. Which in turn can make the node worse with regards to uploads and downloads.

This is also the reason that file walking was moved to a separate process, which can run at lower priority.

I am wondering whether ZFS is making things worse than compared to other file systems. I found a relevant issue when delete lots of files zfs system slow · Issue #13707 · openzfs/zfs · GitHub.

arrogantrabbit · September 9, 2024, 4:13pm

This makes sense, and is the right thing to do. Deleting few TB per day is good enough, there is no hurry.

I’m just trying to understand why does unlink takes so much time, even though there are plenty of resources on the system and both disks and CPU are nowhere near capacity:

It’s not CPU bound:

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 6571 root         57  43    0  2116M   680M uwait    1 292:33   8.20% storagenode

It’s not IO bound (barely 20 IOPS are hitting the disks):

                                                  capacity     operations     bandwidth
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
pool1                                           65.0T   140T    411  2.19K  3.86M  73.3M
  raidz1-0                                      18.9T  46.5T     28     18   741K  1.66M
    gptid/c37ca4dd-8cec-11ee-a737-0025907f338a      -      -      7      4   189K   426K
    gptid/0dd0e79b-8f5b-11ee-880d-0025907f338a      -      -      6      4   181K   425K
    gptid/fe07bab4-8cec-11ee-a737-0025907f338a      -      -      7      4   189K   426K
    gptid/b069f182-9060-11ee-880d-0025907f338a      -      -      6      4   181K   425K
  raidz1-2                                      16.2T  56.5T     23     22   660K  1.92M
    gptid/ea5dbbec-2df9-11ef-affd-ac1f6bbbe6a4      -      -      6      5   167K   491K
    gptid/6cf2c522-2dfa-11ef-affd-ac1f6bbbe6a4      -      -      5      5   161K   490K
    gptid/f8dddfc6-2dfb-11ef-affd-ac1f6bbbe6a4      -      -      6      5   169K   491K
    gptid/cf57e9ee-2dfc-11ef-affd-ac1f6bbbe6a4      -      -      5      5   162K   490K
  raidz1-4                                      29.4T  36.0T     76     16  1.21M  1.23M
    gptid/f4b15820-55b0-11ef-9feb-ac1f6bbbe6a4      -      -     19      4   317K   315K
    gptid/f4d3c79e-55b0-11ef-9feb-ac1f6bbbe6a4      -      -     19      4   304K   314K
    gptid/f512d2b5-55b0-11ef-9feb-ac1f6bbbe6a4      -      -     19      4   317K   315K
    gptid/f52f0b8a-55b0-11ef-9feb-ac1f6bbbe6a4      -      -     19      4   303K   314K
special                                             -      -      -      -      -      -
  mirror-3                                       565G  1.27T    282  2.14K  1.28M  68.4M
    gptid/c8bd1b5f-32ca-11ef-b5fc-ac1f6bbbe6a4      -      -    141  1.07K   657K  34.2M
    gptid/d7a0f4ed-32ca-11ef-b5fc-ac1f6bbbe6a4      -      -    141  1.07K   657K  34.2M
logs                                                -      -      -      -      -      -
  gptid/3843364f-b073-11ec-93ce-5404a613ddea     952K  13.0G      0      1      3   100K
cache                                               -      -      -      -      -      -
  gptid/aee4b1e3-4d61-11ef-90d7-ac1f6bbbe6a4     416G  31.0G    437     31  2.71M  4.38M
----------------------------------------------  -----  -----  -----  -----  -----  -----

It’s not a storagenode issue, it’s likely something in my setup, (or my understanding); I’ll dig more into this latęr

That one is different issue here, op has full filesystem, and no special device, and probably other issues.