Node using a lot of extra diskspace (ZFS)

idfxken · May 2, 2022, 11:47am

So I was cleaning my server a bit, and noticed that the ZFS tank is using around 2TB more then reported.

df -h reports:

Filesystem                 Size  Used Avail Use% Mounted on
storage/STORJ              8.8T  4.8T  4.1T  55% /storage/STORJ

Exporter Data:
Total: 4.000 TB Used: 2.811 TB Trash: 49.02 GB

I’m running a 32TB 4-disk raidz1. Recordsize is 1M and compression is off.

Someone know the reason, why I have around 2TB missing diskspace? Or advice how to cut it down a bit?

flo82 · May 2, 2022, 12:45pm

please post output of:

zfs list -t all

idfxken · May 2, 2022, 12:55pm

zfs list -t all
NAME                                                     USED  AVAIL     REFER  MOUNTPOINT
storage                                                 12.7T  8.37T      186K  /storage
storage/B                                               12.5G  8.37T     12.5G  /storage/B
storage/M                                               7.53T  2.47T     7.53T  /storage/M
storage/n                                                167G  8.37T      167G  /storage/M
storage/STORJ                                           4.76T  4.04T     4.76T  /storage/STORJ
storage/n                                                199G  8.37T      172G  /storage/n
storage/n@autosnap_2022-01-06_10:28:26_yearly           5.41G      -      194G  -

nodesize

flo82 · May 2, 2022, 1:24pm

and also:

zfs get all storage/STORJ

and

zpool get all storage

Alexey · May 2, 2022, 6:41pm

You need to use df --si to have a SI format for usage (the Storj software uses SI by default).

Since you have used a 1M record size, it perhaps used more space with small chunks of data. So you need to have a compression enabled to use the wasted space with such a record size.

You can check the actual usage and usage on the disk with

ls -ls --si /storage/STORJ/storage/

idfxken · May 3, 2022, 7:23am

wim@Atlas:~$ df --si 
storage/STORJ      9.7T  5.3T  4.5T  55% /storage/STORJ

I’ve since enabled compression, but ofc this is useless for the already stored data
It is something that i do need to address though, as this node is filling up at a reasonably fast rate (+350GB/month for the last 2 months)

root@Atlas:~# zfs get all storage/STORJ
NAME           PROPERTY              VALUE                  SOURCE
storage/STORJ  type                  filesystem             -
storage/STORJ  creation              Tue Mar 23 15:18 2021  -
storage/STORJ  used                  4.75T                  -
storage/STORJ  available             4.05T                  -
storage/STORJ  referenced            4.75T                  -
storage/STORJ  compressratio         1.00x                  -
storage/STORJ  mounted               yes                    -
storage/STORJ  quota                 8.80T                  local
storage/STORJ  reservation           none                   default
storage/STORJ  recordsize            1M                     local
storage/STORJ  mountpoint            /storage/STORJ         default
storage/STORJ  sharenfs              off                    default
storage/STORJ  checksum              on                     default
storage/STORJ  compression           lz4                    local
storage/STORJ  atime                 off                    inherited from storage
storage/STORJ  devices               on                     default
storage/STORJ  exec                  on                     default
storage/STORJ  setuid                on                     default
storage/STORJ  readonly              off                    default
storage/STORJ  zoned                 off                    default
storage/STORJ  snapdir               hidden                 default
storage/STORJ  aclmode               discard                default
storage/STORJ  aclinherit            restricted             default
storage/STORJ  createtxg             5770656                -
storage/STORJ  canmount              on                     default
storage/STORJ  xattr                 sa                     inherited from storage
storage/STORJ  copies                1                      default
storage/STORJ  version               5                      -
storage/STORJ  utf8only              off                    -
storage/STORJ  normalization         none                   -
storage/STORJ  casesensitivity       sensitive              -
storage/STORJ  vscan                 off                    default
storage/STORJ  nbmand                off                    default
storage/STORJ  sharesmb              off                    default
storage/STORJ  refquota              none                   default
storage/STORJ  refreservation        none                   default
storage/STORJ  guid                  12066305309182303613   -
storage/STORJ  primarycache          all                    default
storage/STORJ  secondarycache        all                    default
storage/STORJ  usedbysnapshots       0B                     -
storage/STORJ  usedbydataset         4.75T                  -
storage/STORJ  usedbychildren        0B                     -
storage/STORJ  usedbyrefreservation  0B                     -
storage/STORJ  logbias               latency                default
storage/STORJ  objsetid              1965                   -
storage/STORJ  dedup                 off                    default
storage/STORJ  mlslabel              none                   default
storage/STORJ  sync                  standard               default
storage/STORJ  dnodesize             legacy                 default
storage/STORJ  refcompressratio      1.00x                  -
storage/STORJ  written               4.75T                  -
storage/STORJ  logicalused           4.84T                  -
storage/STORJ  logicalreferenced     4.84T                  -
storage/STORJ  volmode               default                default
storage/STORJ  filesystem_limit      none                   default
storage/STORJ  snapshot_limit        none                   default
storage/STORJ  filesystem_count      none                   default
storage/STORJ  snapshot_count        none                   default
storage/STORJ  snapdev               hidden                 default
storage/STORJ  acltype               off                    default
storage/STORJ  context               none                   default
storage/STORJ  fscontext             none                   default
storage/STORJ  defcontext            none                   default
storage/STORJ  rootcontext           none                   default
storage/STORJ  relatime              off                    default
storage/STORJ  redundant_metadata    all                    default
storage/STORJ  overlay               on                     default
storage/STORJ  encryption            off                    default
storage/STORJ  keylocation           none                   default
storage/STORJ  keyformat             none                   default
storage/STORJ  pbkdf2iters           0                      default
storage/STORJ  special_small_blocks  0                      default

root@Atlas:~# zpool get all storage
NAME     PROPERTY                       VALUE                          SOURCE
storage  size                           29.1T                          -
storage  capacity                       59%                            -
storage  altroot                        -                              default
storage  health                         ONLINE                         -
storage  guid                           11950497713507518777           -
storage  version                        -                              default
storage  bootfs                         -                              default
storage  delegation                     on                             default
storage  autoreplace                    off                            default
storage  cachefile                      -                              default
storage  failmode                       wait                           default
storage  listsnapshots                  off                            default
storage  autoexpand                     off                            default
storage  dedupratio                     1.00x                          -
storage  free                           11.7T                          -
storage  allocated                      17.4T                          -
storage  readonly                       off                            -
storage  ashift                         0                              default
storage  comment                        -                              default
storage  expandsize                     -                              -
storage  freeing                        0                              -
storage  fragmentation                  29%                            -
storage  leaked                         0                              -
storage  multihost                      off                            default
storage  checkpoint                     -                              -
storage  load_guid                      4180890661359379171            -
storage  autotrim                       off                            default
storage  compatibility                  off                            default
storage  feature@async_destroy          enabled                        local
storage  feature@empty_bpobj            active                         local
storage  feature@lz4_compress           active                         local
storage  feature@multi_vdev_crash_dump  enabled                        local
storage  feature@spacemap_histogram     active                         local
storage  feature@enabled_txg            active                         local
storage  feature@hole_birth             active                         local
storage  feature@extensible_dataset     active                         local
storage  feature@embedded_data          active                         local
storage  feature@bookmarks              enabled                        local
storage  feature@filesystem_limits      enabled                        local
storage  feature@large_blocks           active                         local
storage  feature@large_dnode            enabled                        local
storage  feature@sha512                 enabled                        local
storage  feature@skein                  enabled                        local
storage  feature@edonr                  enabled                        local
storage  feature@userobj_accounting     active                         local
storage  feature@encryption             enabled                        local
storage  feature@project_quota          active                         local
storage  feature@device_removal         enabled                        local
storage  feature@obsolete_counts        enabled                        local
storage  feature@zpool_checkpoint       enabled                        local
storage  feature@spacemap_v2            active                         local
storage  feature@allocation_classes     enabled                        local
storage  feature@resilver_defer         enabled                        local
storage  feature@bookmark_v2            enabled                        local
storage  feature@redaction_bookmarks    enabled                        local
storage  feature@redacted_datasets      enabled                        local
storage  feature@bookmark_written       enabled                        local
storage  feature@log_spacemap           active                         local
storage  feature@livelist               enabled                        local
storage  feature@device_rebuild         enabled                        local
storage  feature@zstd_compress          enabled                        local
storage  feature@draid                  enabled                        local

Although ashift=0 in zpool get all , zdb -C shows ashift=12 (I know a wrong ashift can give these issues)

Alexey · May 3, 2022, 7:37am

If my guess is correct (that the record size is involved), and you cannot force compress on already stored data, then you can only try to migrate it to the other folder, using this guide: How do I migrate my node to a new device? - Storj Docs
It should store compressed data in a new location.

But I would recommend first confirming my guess with the command

idfxken · May 3, 2022, 7:41am

root@Atlas:~# ls -ls --si /storage/STORJ/storage/
total 69k
 13k -rw------- 1 systemd-coredump root             8.6k Dec 12 15:40 config.yaml
 512 drwx------ 4 systemd-coredump root                4 Mar 23  2021 orders
 37k -rw------- 1 systemd-coredump root              33k Apr 28 16:22 revocations.db
 13k drwx------ 6 systemd-coredump root               40 May  3 09:37 storage
6.7k -rw------- 1 systemd-coredump systemd-coredump 2.8k May  3 03:12 trust-cache.json

Not really sure, what im looking for?

Alexey · May 3, 2022, 8:47am

Seems you have another level, so the command should be

ls -ls --si /storage/STORJ/storage/storage/

The first column - how much space is used on the disk. The size column in the middle - how much space it should have to use. But you already see a difference almost in 2x times.
You can also check for files inside the /storage/STORJ/storage/storage/blobs/ folder - this is a storage for pieces.

Toyoo · May 3, 2022, 9:58am

Storj blobs come encrypted (or at least they should!), so there’s no point in trying to compress them.

idfxken · May 3, 2022, 10:13am

No offense, but I’ve researched aswell, and there’s a lot of debate, that if you do not enable compression, it will always use max recordsize. Also, current advice I gained from several sources, is to Always enable compression, except for specific workloads.

Edit: I lied, appareantly I rebuilt the pool in 2020

idfxken · May 3, 2022, 10:24am

Yes indeed, my mistake. split storage and identity in the pool.
If I’m interpreting correctly, left is the real size in used diskspace, right is the real file space.

Checked one of the blobs. Thats painfull, got hundreds 1.3k files taking up 1M or 512k
example:

  1M -rw------- 1 systemd-coredump systemd-coredump 1.3k Nov 30 23:18 3v24cno63spty73iwafjs524erfehth3ml74hz5rklzdla6bzq.sj1
513k -rw------- 1 systemd-coredump systemd-coredump 1.3k Mar 28 04:34 4gpwx23xxp2tnueydrxhgnca7fg35cf2iluulo55ef45qpvyka.sj1

Guess I’ll make a new dataset, and rsync it over. Ty for the efforts!

Toyoo · May 3, 2022, 12:08pm

Ah, so apparently zfs is even weirder than I thought!

flo82 · May 3, 2022, 1:29pm

compression is useful - also for encrypted data. You should achieve a 1.2 - 1.3 ratio. Modern CPUs are fast in lz4 de-/compression and you need to store less data on your rust devices. So enable it.

Regarding your used space/configuration everything looks fine and as expected. See also this thread: disk usage wrong when using larger recordsize, raidz and ashift=12 · Issue #4599 · openzfs/zfs · GitHub

TLDR: if using RAIDZ with a lot of small files, then extra space for padding/parity could eat up more space than expected.

flo82 · May 3, 2022, 1:30pm

I’m running mirrored VDEVs and with standard recordsize 128K - everything is fine (space and performance wise). See ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog .

Furthermore I added a mirrored special device on fast SSDs for metadata only (small files are on rust devices) - see ZFS Metadata Special Device: Z - Wikis & How-to Guides - Level1Techs Forums this guarantees fast access times to win the race for storing data. My observations are: For small files you need around 5 GB of space for metadata for storing 1 TB of actual data.

SGC · May 3, 2022, 2:46pm

you didn’t use like ashift=13 rather than ashift=12

you have to have a very large minimum sector / block size to get that amount of wasted space.

lets see the output of this command.
$zpool get ashift

12 is 4k blocks, every increment halves or doubles the size.
so ashift 9 is 512.
10 = 1024
11 = 2048
12 = 4096 (4k which would work with any hdd’s today)

also running ZFS here, no issues… capacity used and stored data aligns just fine.
using 256k recordsizes… but you can use any recordsize you want if you don’t mind the memory and cache issues from using the higher end…
my recommended recordsize for storj is 256k, but i have tried all of them for months…
doesn’t really change much.

recordsizes are adaptive… it is a max possible size, atleast when using all compression types even ZLE, which would be the minimum recommended…
ZLE being Zero Length Encoding… basically reducing empty space to simple codes… without that…

not 100% sure what happens with compression none… that could mean record’s are written in full sizes…

only way to apply “compression” to all the records would be to do zfs send | zfs recv
and basically send all the data to a new dataset…
on the upside zfs send | zfs recv is rather good at working with storj data…
moves it at like 10x of what rsync can do at the best of times.

skookum · May 3, 2022, 9:56pm

@idfxken I see that thread was marked as Solved but it was not the answer that I had expected. Can you show the new results of your first post if you have since migrated to a different dataset?

idfxken · May 4, 2022, 7:44am

@skookum
Quite simple test: I enabled compression yesterday, check the dates, check the disk vs real size.

513k -rw------- 1 systemd-coredump systemd-coredump 1.3k Apr 15 01:40 zfwyncydxgcxfpx4remq7hwezawoccjd6ukvblqkdwml4rww5a.sj1
6.7k -rw------- 1 systemd-coredump systemd-coredump 1.3k May  3 05:59 z5nxy5ys3loxudijtfzfy63ud25ezj4uyhvo4ahpzzbzbywwra.sj1

In this case I’m losing 512kb-1M for a 1.3k file(no idea why its not always 1M or 513k?)
After a quick search, there are thousands and thousands of 1.3k files in my dataset with these results…

I also tested with a batch of a blob the size of around 1GB.
New dataset: recordsize 1M, lz4 compression enabled

root@Atlas:~# cd /storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/
root@Atlas:/storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/64# cp -r * /storage/storjnew/
root@Atlas:/storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/64# du --si
966M	.
root@Atlas:/storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/64# du --si /storage/storjnew/
881M	/storage/storjnew/
root@Atlas:/storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/64# zfs get compressratio storage/STORJ
NAME           PROPERTY       VALUE  SOURCE
storage/STORJ  compressratio  1.01x  -
root@Atlas:/storage/STORJ/storage/storage/blobs/pmw6tvzmf2jv6giyybmmvl4o2ahqlaldsaeha4yx74n5aaaaaaaa/64# zfs get compressratio storage/storjnew
NAME              PROPERTY       VALUE  SOURCE
storage/storjnew  compressratio  1.32x  -

On this folder were talking savings of only about 10%, which is not 40% but: this is only a 1GB sample. And both tests indicate @Alexey is highly probable on the right path.

@SGC as I mentioned above

As said above: LZ4 is enabled now (no idea why i disabled it on storj )

Jup!

Best not to do zfs send | zfs receive inside the same pool. This can make a server die/crawl to a stop. I’ve seen loads above 250! happen in these situations. rsync might be slow, but its pretty safe, and I can just keep the node running.

I’ll report back when its done, might be a while though.

Alexey · May 4, 2022, 8:04am

The encrypted data is not compressible too much (it can even increase in size), but in zfs implementation it compress wasted space in a max record size.
However I expected to see more than 1.3 ratio.

And this only reinforced my opinion that zfs is not the best solution for Storj, despite the abundance of configuration options. However, if you decided to use RAID there is no alternative with todays disks and high probability of bitrot during rebuild after lose one of the disk.
But RAID0 (striped) will not survive even with zfs, if one disk die.

SGC · May 4, 2022, 10:40am

use zfs send | zfs recv a few times on my server, never had a problem with it.
ofc it takes a long time, but rsync is just so slow i can’t spend 10x the time waiting for something that copies each file individually