[Tech Preview] Hashstore backend for storage nodes

Roberto · February 15, 2025, 10:23am

maybe it might interest you

Alexey · February 16, 2025, 4:31am

Please check, maybe your node has it too:

snorkel · February 16, 2025, 7:25am

The 122 version didn’t solved the problem. The ingress is still down. Dashboard shows overused, but the disk has a lot of free space, and I know for sure the allocated space is not filled. When I startes the migration, both nodes on the same machine had like 5TB of data, from 7TB allocated. Now after 16 days, the migrated node shows overused?!?!, but the other not migrated node shows 5TB as expected.
And I can’t enable full disk allocation feature, because it’s Synology.

Alexey · February 16, 2025, 7:43am

This is weird. I was not able to reproduce it in storj-up.
Would share with the team

Alexey · February 16, 2025, 7:52am

why? And you likely can set a quota.

However, it’s not a solution of course.

Could you please try to restart this node?

seanr22a · February 16, 2025, 9:09am

I have to agree with snorkel. The numbers shown in the web interface is very wrong. 12 of my nodes has finished migrating, they are all on ‘Dedicated disk’. 6 of them has been updated to 122.1
The numbers are wrong both in 121.2 and 122.1 (all nodes has been restarted). The nodes are working fine but the web interface is broken.

This is one of the 122.1 nodes:

[EDIT]
du -sh on the hashstore directory show 2.8TB used (Debian 12 with docker)

PieceKeeper · February 16, 2025, 9:26am

My single hashstore node looks like @seanr22a 's. The node is running version 122.1.

du -sh on the hashstore directory show 961GB used (Debian 12 with docker). The huge difference between the avg disk space used numbers and the actual used space have been like this for weeks.

snorkel · February 16, 2025, 9:41am

I’ve restarted like 4 times. Node upgrade, DSM upgrade, simple restarts. All walkers finish thier jobs. No errors. It’s just a buggy space calculation in the software.

snorkel · February 16, 2025, 8:57pm

Just the log entries for hashstore compaction, for curious SNOs:

2025-02-16T05:36:44Z    INFO    hashstore       beginning compaction    {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "stats": {"NumLogs":867,"LenLogs":"0.8 TiB","NumLogsTTL":0,"LenLogsTTL":"0 B","SetPercent":1.0001442678378505,"TrashPercent":0.006859394085655081,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":0,"LogsRewritten":0,"DataRewritten":"0 B","Table":{"NumSet":2197504,"LenSet":"0.8 TiB","AvgSet":421631.6421248835,"NumTrash":25600,"LenTrash":"5.9 GiB","AvgTrash":248225.28,"NumSlots":8388608,"TableSize":"512.0 MiB","Load":0.261962890625,"Created":20132},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}
2025-02-16T05:36:45Z    INFO    hashstore       compact once started    {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "today": 20135}
2025-02-16T05:36:57Z    INFO    hashstore       compaction computed details     {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "nset": 2170579, "nexist": 2170579, "modifications": true, "curr logSlots": 23, "next logSlots": 23, "candidates": [], "rewrite": [], "duration": "11.870277823s"}
2025-02-16T05:37:08Z    INFO    hashstore       hashtbl rewritten       {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "total records": 2170579, "total bytes": "0.8 TiB", "rewritten records": 0, "rewritten bytes": "0 B", "trashed records": 75390, "trashed bytes": "29.6 GiB", "restored records": 27244, "restored bytes": "8.2 GiB", "expired records": 0, "expired bytes": "0 B"}
2025-02-16T05:37:09Z    INFO    hashstore       compact once finished   {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "duration": "24.105551293s", "completed": true}
2025-02-16T05:37:09Z    INFO    hashstore       finished compaction     {"Process": "storagenode", "satellite": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "store": "s1", "duration": "24.465421226s", "stats": {"NumLogs":867,"LenLogs":"0.8 TiB","NumLogsTTL":0,"LenLogsTTL":"0 B","SetPercent":0.9999980899555468,"TrashPercent":0.03432310241738654,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":20135,"LogsRewritten":0,"DataRewritten":"0 B","Table":{"NumSet":2170579,"LenSet":"0.8 TiB","AvgSet":426799.393334221,"NumTrash":75390,"LenTrash":"29.6 GiB","AvgTrash":421767.40427112347,"NumSlots":8388608,"TableSize":"512.0 MiB","Load":0.2587531805038452,"Created":20135},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}

MarkRose · February 16, 2025, 9:07pm

I’m not yet running Storj, just doing my research.

With this approach, nodes may do better with XFS than ext4. Ext4 can lock up momentarily when deleting large files as it deallocates space. I run Kafka, Cassandra, and MySQL in a production environment with gigabyte+ files, and to avoid this issue I use XFS.

With regards to fragmentation, ext4 uses several tricks to avoid it, and particularly relevant for large files are the 128 MB block groups. Ext4 will try to put all the file data in the same block group. When a block group is full it will move on to the next. Writing a few large files will largely avoid fragmentation.

XFS uses allocation groups (Chapter 13). More AGs allow more concurrent writes. If files being written to in parallel are in different allocation groups they won’t fragment with each other. XFS was designed for parallel reads and writes and excels at log storage (like Kafka logs, Cassandra sstables, so on). It is likely it will perform very well with hashstore. XFS performance can suffer when there is little free space remaining and files get stored across multiple allocation groups. Allocation groups can be up to 1 TB in size, which is the default for filesystems over 4 TB in size (between 128 MB and 4 TB, 4 AGs are used). Hashstore may benefit from having more if more than 4 files are being written concurrently.

snorkel · February 16, 2025, 9:29pm

I don’t know anyone running XFS here, most are running NTFS, ext4 and ZFS. But if anyone is skilled enough, could try it and make a tunning guide, like the “on tunning ext4 fs for storagenodes” thread.

MarkRose · February 17, 2025, 1:08am

I saw that most people were running ext4 or zfs, which have their strong points, like the fast commit option I was unaware of. The difference here is the deletion of large files causing an extended write lock. It’s not a concern with deleting small files. It may also not be a problem if ext4 is mounted with data=journal but that creates extra write load in general.

MarkRose · February 17, 2025, 1:51am

Another thought occurred to me: hashstore should play well with SMR drives. Furthermore, it should work with zoned storage drives if the hashstore files are stored in zones instead, which are typically 256 MB or 1 GB. I don’t want to spend the cash badly enough to get one off eBay to experiment.

Mark · February 17, 2025, 4:12am

Less than a TiB in s1.
Do you have any compaction logs for your s0 store?

molnart · February 17, 2025, 6:39am

migration is “finished” in something like 13 days, but there look to be some dangling pieces that cannot be migrated

2025-02-17T07:27:32+01:00	INFO	piecemigrate:chore	couldn't migrate	{"Process": "storagenode", "error": "opening the old reader: pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose": "opening the old reader: pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).migrateOne:318\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).processQueue:260\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).Run.func2:167\n\tstorj.io/common/errs2.(*Group).Go.func1:23", "sat": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "id": "WEOQ6T5BYKKB4AZDK23NQNTUURDOP7GHY64NLDVO37C2A4PL7RMQ"}
2025-02-17T07:27:33+01:00	INFO	piecemigrate:chore	couldn't migrate	{"Process": "storagenode", "error": "opening the old reader: pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)", "errorVerbose": "opening the old reader: pieces error: invalid piece file for storage format version 1: too small for header (0 < 512)\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).migrateOne:318\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).processQueue:260\n\tstorj.io/storj/storagenode/piecemigrate.(*Chore).Run.func2:167\n\tstorj.io/common/errs2.(*Group).Go.func1:23", "sat": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "id": "7FY52IOAKIO4WGT56OPC33TQ5JFLKJSU3IGZYUQPUM2EOONJKPVQ"}

the blobs directory has 3.5 GB of files remaining, but they just look like empty two-letter folders that for some reason report a size of 2.7 MB per folder

ncdu 1.18 ~ Use the arrow keys to navigate, press ? for help
--- /StoragePool/Storj/storage/blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa -------------
e   2.7 MiB [  0.1% ############# ]        /yt
e   2.7 MiB [  0.1% ############# ]        /yd
e   2.7 MiB [  0.1% ############# ]        /yc
e   2.7 MiB [  0.1% ############# ]        /xv
e   2.7 MiB [  0.1% ############# ]        /wy
e   2.7 MiB [  0.1% ############# ]        /wj
e   2.7 MiB [  0.1% ############# ]        /wb
e   2.7 MiB [  0.1% ############# ]        /w3
e   2.7 MiB [  0.1% ############# ]        /vh
e   2.7 MiB [  0.1% ############# ]        /vd
e   2.7 MiB [  0.1% ############# ]        /v3
e   2.7 MiB [  0.1% ############# ]        /uy
e   2.7 MiB [  0.1% ############# ]        /ul
e   2.7 MiB [  0.1% ############# ]        /u6
e   2.7 MiB [  0.1% ############# ]        /u3
e   2.7 MiB [  0.1% ############# ]        /tw
e   2.7 MiB [  0.1% ############# ]        /tn
e   2.7 MiB [  0.1% ############# ]        /sz
e   2.7 MiB [  0.1% ############# ]        /sr
e   2.7 MiB [  0.1% ############# ]        /sk
e   2.7 MiB [  0.1% ############# ]        /sf
e   2.7 MiB [  0.1% ############# ]        /s3
e   2.7 MiB [  0.1% ############# ]        /rt
e   2.7 MiB [  0.1% ############# ]        /rk
e   2.7 MiB [  0.1% ############# ]        /ri
e   2.7 MiB [  0.1% ############# ]        /rb
e   2.7 MiB [  0.1% ############# ]        /qy
 Total disk usage:   2.7 GiB  Apparent size:   3.0 KiB  Items: 1026

striker43 · February 17, 2025, 7:47am

In case you’re interested, here are logs of my oldest node (started 02-2020) that currently holds roughly 2.5TB and is fully migrated to hashstore:

... docker logs storagenode1 2>&1 | grep "compact"
2025-02-16T15:58:31Z    INFO    hashstore       beginning compaction    {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "stats": {"NumLogs":36,"LenLogs":"31.5 GiB","NumLogsTTL":2,"LenLogsTTL":"1.3 MiB","SetPercent":0.9845225368642283,"TrashPercent":0,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":0,"LogsRewritten":0,"DataRewritten":"0 B","Table":{"NumSet":195008,"LenSet":"31.0 GiB","AvgSet":170498.31046931408,"NumTrash":0,"LenTrash":"0 B","AvgTrash":0,"NumSlots":524288,"TableSize":"32.0 MiB","Load":0.3719482421875,"Created":20132},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}
2025-02-16T15:59:15Z    INFO    hashstore       compaction acquired locks       {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "duration": "44.907143996s"}
2025-02-16T15:59:15Z    INFO    hashstore       compact once started    {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "today": 20135}
2025-02-16T15:59:17Z    INFO    hashstore       compaction computed details     {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "nset": 196709, "nexist": 196714, "modifications": true, "curr logSlots": 19, "next logSlots": 19, "candidates": [58], "rewrite": [58], "duration": "1.46140901s"}
2025-02-16T15:59:18Z    INFO    hashstore       compact once finished   {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "duration": "2.973751187s", "completed": true}
2025-02-16T15:59:18Z    INFO    hashstore       finished compaction     {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "duration": "47.920332943s", "stats": {"NumLogs":35,"LenLogs":"31.5 GiB","NumLogsTTL":1,"LenLogsTTL":"0.7 MiB","SetPercent":1,"TrashPercent":0.13433983664011503,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":20135,"LogsRewritten":1,"DataRewritten":"0 B","Table":{"NumSet":196709,"LenSet":"31.5 GiB","AvgSet":171689.8792022734,"NumTrash":55899,"LenTrash":"4.2 GiB","AvgTrash":81165.17003882001,"NumSlots":524288,"TableSize":"32.0 MiB","Load":0.37519264221191406,"Created":20135},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}

snorkel · February 17, 2025, 7:56am

Here:

2025-02-16T22:44:28Z    INFO    hashstore       beginning compaction    {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "stats": {"NumLogs":84,"LenLogs":"83.6 GiB","NumLogsTTL":0,"LenLogsTTL":"0 B","SetPercent":1.0446324824690063,"TrashPercent":0,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":0,"LogsRewritten":0,"DataRewritten":"0 B","Table":{"NumSet":399552,"LenSet":"87.3 GiB","AvgSet":234551.10171391958,"NumTrash":0,"LenTrash":"0 B","AvgTrash":0,"NumSlots":1048576,"TableSize":"64.0 MiB","Load":0.38104248046875,"Created":20131},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}
2025-02-16T22:44:28Z    INFO    hashstore       compact once started    {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "today": 20135}
2025-02-16T22:44:29Z    INFO    hashstore       compaction computed details     {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "nset": 391274, "nexist": 391274, "modifications": true, "curr logSlots": 20, "next logSlots": 20, "candidates": [], "rewrite": [], "duration": "1.400793419s"}
2025-02-16T22:44:30Z    INFO    hashstore       hashtbl rewritten       {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "total records": 391274, "total bytes": "83.5 GiB", "rewritten records": 0, "rewritten bytes": "0 B", "trashed records": 22358, "trashed bytes": "6.4 GiB", "restored records": 0, "restored bytes": "0 B", "expired records": 0, "expired bytes": "0 B"}
2025-02-16T22:44:30Z    INFO    hashstore       compact once finished   {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "duration": "2.432951267s", "completed": true}
2025-02-16T22:44:30Z    INFO    hashstore       finished compaction     {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "store": "s0", "duration": "2.673125788s", "stats": {"NumLogs":84,"LenLogs":"83.6 GiB","NumLogsTTL":0,"LenLogsTTL":"0 B","SetPercent":1,"TrashPercent":0.0766574561720715,"Compacting":false,"Compactions":0,"TableFull":0,"Today":20135,"LastCompact":20135,"LogsRewritten":0,"DataRewritten":"0 B","Table":{"NumSet":391274,"LenSet":"83.6 GiB","AvgSet":229280.05119685948,"NumTrash":22358,"LenTrash":"6.4 GiB","AvgTrash":307587.52088737814,"NumSlots":1048576,"TableSize":"64.0 MiB","Load":0.37314796447753906,"Created":20135},"Compaction":{"Elapsed":0,"Remaining":0,"TotalRecords":0,"ProcessedRecords":0}}}

So… s0, s1, s2 are the satellites?

striker43 · February 17, 2025, 8:28am

I don’t think so, the s0, s1… directories are inside the satellite folders.

littleskunk · February 17, 2025, 12:42pm

For each satellite there is an s0 and s1 hashtable. One is the active one that gets all the writes and one is the passive one that can get compacted. We don’t want writes and compact at the same time. So thats why there are 2 of them. And every now and than they will switch roles so that compact gets executed on both one by one.

littleskunk · February 17, 2025, 12:45pm

My solution for these pieces:

find blobs/* -type f -empty -delete
find blobs/* -type d -empty -delete