[Tech Preview] Hashstore backend for storage nodes

Thanks but I mean.. In my case, something is causing compaction on some nodes to take an extremely long time. 54 hours seems like an exaggerated value, and it has happened to me on multiple nodes. I don’t know if it’s due to a loop related to 0-byte files or something else. I just wanted to highlight my situation and let the team know that other nodes might face similar issues—especially nodes run by people who don’t even read the forum. I wasn’t looking for a specific fix for my case but a fix in the node software

Longest compaction I found on my nodes is 24s - 3TB stored.on that node.

… causing compaction on FEW nodes to take extremely long time

The plan is starting with using hashstore for new writes (WriteToNew). Active migration will be started later.

Rollout is driven by Satellite (during the checkins). Current config:

Write new data to hashstore:

SLC - 5%
AP1 - 0%
EU1 - 0%
US1 - 0%

Active migration:
ALL - 0%

It likely doesn’t make any difference, as 37% of the nodes are already on hashstore (manual setting from the node operators)

You can monitor your current status with checking the .migrate file.

For example (WriteToNew is the important part):

cat config/storage/hashstore/meta/12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.migrate

{
  "PassiveMigrate": false,
  "WriteToNew": true,
  "ReadNewFirst": true,
  "TTLToNew": true
}

For active migration, check the migrate_chore file

3 Likes

.. of which 25% are @Vadim :wink:

But seriously, I am surprised that so many early adopters have made the manual switch. :slight_smile:

2 Likes

as 37% of the nodes are already on hashstore

Very interesting figure! Is that 37% of nodes who are completely migrated, has ongoing migrations, or has enabled a single WriteToNew flag?

1 Like

Just WriteToNew flag. And 37% includes the select network as well (which was migrated months ago…)

1 Like

Ah, that makes more sense although I have no idea how Select’s size compares to Public.

In all honesty, Select seems so different from Public in so many ways that quite frankly I just pretend it doesn’t even exist.

The downside of short compaction is that the amount of reclaimed space is relative small. On my node that is not a problem. I still have a lot of free space and I don’t mind trading it for short compaction runtime. There might be a point in time in the future at which my node is almost full. At that point I would change the config of my node to reclaim more space with the downside of spending more time with running compact. But hey at that point we would be talking about maximum payout and just optimising for some extra cent. Its like the final challenge that I would love to get to.

1 Like

I have free space, Any advice on how to reduce compaction times at the expense of a bit of wasted space? On 10 TB nodes it’s starting to become a heavy task.

      --hashstore.compaction.alive-fraction float                if the log file is not this alive, compact it (default 0.25)
      --hashstore.compaction.probability-power float             power to raise the rewrite probability to. >1 means must be closer to the alive fraction to be compacted, <1 means the opposite (default 2)

Decreasing the first one would queue up less log files for compaction but it also means more overhead on disk. I would also increase the second value to decrease the probability that it might pick up a log file that is close to the allive threshold.

2 Likes

Well the first step is to look at system behaviour during this long compaction. Disk queues, disk IO, memory pressure, wait times, compute pressure, etc. Otherwise it’s unproductive guesswork.

Repeatedly parsing zero sized files contributes to IO, and needs to be fixed regardless.

What’s the difference between “finished compaction” and “compact once finished”?

There also appears to be some shenanigans with time formatting:

storj-five# grep "compact once finished" /var/log/storagenode.log | grep -o '{.*}' | jq '[.stats.LogsRewritten, .duration] | join(" ")'  
" 426.025834ms"
" 1.759233558s"
" 13.31169607s"
" 16.526177553s"
" 14.228507974s"
" 14.692018117s"
" 19.417338659s"
" 1.840648903s"
" 19.196724614s"
" 33.832226ms"
# grep "finished compaction" /var/log/storagenode.log | grep -o '{.*}' | jq '[.stats.LogsRewritten, .duration] | join(" ")' 
"3 426.204768ms"
"8 1m0.518026162s"
"9 19.417862644s"
"6 21.037603812s"
"0 34.087853ms"

What is 1m0.518026162s?

Compact once finished
A compaction loop (many can run if the store requires it for each sat s0/s1 it)

Finished compaction
All loops for compaction has completed for that sat s0/s1 store.

Likely you will see multiple “compact once finished” before a “Finished compaction” is thrown.

1 Like

Is “Compact once” a loop per satellite or one pass of compaction, and these passes will continue being scheduled until there are no more logs that satisfy compaction criteria?

A few days ago (when the latest deletion batch were purged) I had some fairly long compaction sessions as well. I think they were 6-8hours from normally ms > a few minutes. Also, they had many compact runs before the completion.

Are you back to normal compaction time again ?

1 Like

It’s one pass of compaction per store per sat (2 x stores for each sat). It will continue it’s passes until compaction is satisfied.

I think we are actually trying to say the same thing, I’m just less fluent in english :wink:

1 Like

This is json, supposed to be machine readable, there is no need to do acrobatics with units. Always write duration in seconds, without units. This will remove all issues.

Thank you for the insight and clarification. 37% is very much a non-significant amount of people. It could point towards the forum being filled with many, many more lurkers than I anticipated before. Reddit published numbers in the mid 10s, where they stated that ~95% of users did not participate in comments, and ~98% not in posts. With the insane push to mobile devices in the last couple of years, I can only imagine these numbers are higher now.

they are very old nodes. probably deletion batch consequence. I will monitor next compact runs