Last few days node is busy deleting files from trash. But it seems to be doing it quite slowly - about 200-300 files per second.
Looking at how much time is spent in sys_unlink it’s clear it’s the bottleneck here – taking 900ms out of each second:
Counting deletes
Storagenode(s) deleted 219 files per second
Deletions took 911 ms per second
Storagenode(s) deleted 223 files per second
Deletions took 922 ms per second
Storagenode(s) deleted 229 files per second
Deletions took 923 ms per second
dtrace script
#!/usr/sbin/dtrace -qs
BEGIN
{
printf("Counting deletes\n");
tm = 0;
c = 0;
}
fbt:kernel:sys_unlink:entry
/execname=="storagenode"/
{
self->t = timestamp;
}
fbt:kernel:sys_unlink:return
/self->t != 0/
{
tm += (timestamp - self->t)/1000;
++c;
self->t = 0;
}
tick-10sec
{
printf("Storagenode(s) deleted %d files per second\n", c/ 10);
printf("Deletions took %d ms per second \n", tm/1000/10);
tm = 0;
c=0;
}
I can’t upload the interactive flame graph on this forum, but it looks like so:
If I’m reading the flame graph correctly (note, it’s not in chronological order), a lot of time is spent synchronously reclaiming resources.
I’m not sure if this is fundamental issue, or I have something misconfigured, or unlinking files can be made more performant; and whether parallelizing it will help in any way (I’m going to experiment with this).
In the mean time putting this out here in case someone with zfs experience has some ideas…
It’s like that on Linux and Windows too.
Seems all FSes are struggling with deleting a big amount of small files.
I do not have any suggestions how to improve that.
We need to stat each piece, to include its size to the amount of the deleted data to then update databases… Thus it’s so slow.
We have discussed several ideas internally, like do not perform any stat in any condition (but the requirement is to allocate the whole disk/partition), or to go with a On a statistical approach for the used space file walker (the team is excited about the idea, by the way, thanks @Toyoo, your ideas are very appreciated and we did not forget about you other nice idea Design draft: a low I/O piece storage !)
nono, that part is is not a problem, it’s fast. Storagenode is running full speed, consuming 100-120% of a CPU, but 90% of time is spent inside unlink syscalls. (To eliminate time of roundtrip to kernel from the measurement set probe on kernel unlink function, see above).
I.e. storagenode would delete much faster if unlink syscalls would complete faster. I’m not sure if parallelizing calls would help – there are 48 cores that are pretty much idle – it would depend on whether slow bits are under locks and have to be serialized.
I.e. storagenode would delete much faster if unlink syscalls would complete faster. I’m not sure if parallelizing calls would help – there are 48 cores that are pretty much idle – it would depend on whether slow bits are under locks and have to be serialized.
The main reason we avoided parallelizing the thrashing or walking over the pieces, is can easily end up saturating the disk iops. Which in turn can make the node worse with regards to uploads and downloads.
This is also the reason that file walking was moved to a separate process, which can run at lower priority.
This makes sense, and is the right thing to do. Deleting few TB per day is good enough, there is no hurry.
I’m just trying to understand why does unlink takes so much time, even though there are plenty of resources on the system and both disks and CPU are nowhere near capacity:
It’s not CPU bound:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
6571 root 57 43 0 2116M 680M uwait 1 292:33 8.20% storagenode
It’s not IO bound (barely 20 IOPS are hitting the disks):