Poor Success rate on hashstore

RecklessD · September 12, 2025, 10:52pm

How do we collect this data?

I only see large change in repair upload, ie from satellites.

snorkel · September 13, 2025, 6:23am

When I reported poor performance on hashstore nodes, I wasn’t looking at success rates. I was looking only at monthly average stored, which is the essential metrics, it gives us part of the payout.
So if you still want to compare performance, compare this data from month to month between nodes.

RecklessD · September 13, 2025, 10:41am

I also look at Ingres, egress, total stored over last 30/90/120. I don’t have enough data on hashstore yet, and seeing massive deletes this month across hashstore and non hashstore nodes - looking to be worst month in 5 months atm.

jammerdan · September 14, 2025, 4:50am

RAM speed is great. However my impression from the rollout announcement was that the speed gains are coming from performing sequential writes to disk instead of random writes:

What you are saying sounds like that uploads get stored in memory and are eventually stored on disk at some later time. So is this like an in memory writeback cache that holds all the successful uploaded pieces until they get flushed to disk?
I cannot verify this, but what I have read was
https://review.dev.storj.tools/c/storj/storj/+/14910

Data is stored in log files (called extents in ShardStore). Each log
file is an append only file where piece data is written into it
sequentially. When a new piece comes in, an open and available log file
handle is selected and claimed and the piece is written into that log
file. Then the log file is flushed to disk and returned.

If all pieces are uploaded to RAM, what determines how much data will be kept in volatile memory that will be lost e.g. on a power loss? Because from what you are saying I assume once a piece has been written to the memory it is considered as uploaded to the satellite even if it is not yet flushed to disk.

Also this sounds like there is no benefit from either utilizing SSDs directly or indirectly as writing cache e.g. via LVM and you’d always have to throw more RAM at a node for upload performance optimization?

jammerdan · September 14, 2025, 5:00am

Yes, deletions are very bad at the moment. I am wondering if this is a result of

and I am wondering what will happen due to new pricing and the effects on existing customers. The fact that existing projects are exempt from the new pricing fro a full year might be telling for what to be expect otherwise, who knows.

Hopefully customers won’t flee and new customers will appreciate the new pricing.

Alexey · September 14, 2025, 5:14am

Yes, all pieces are uploaded to RAM initially, as for piecestore, then flushed to the disk async (so, OS may still keep it in the memory) and report to the uplink that upload was successful.

In some cases you may still use it, if you would see a disk as a bottleneck (e.g. if your internet speed is faster than your HDD). But you likely would use it as a tiered storage, not as a special device for a metadata.

jammerdan · September 14, 2025, 5:36am

This is what is still not clear to me as there different options and it is basically what differs a writeback cache from a writethrough cache: Will the upload be reported as successful to uplink when the piece has been stored in memory or only after it has been flushed to the disk?

Alexey · September 14, 2025, 5:38am

It will be reported as uploaded when the OS would report that the file is saved successfully. However since we do not use a forced sync write, the file actually may still be in memory - depends on the OS.

Toyoo · September 14, 2025, 5:45am

Uh, a happy side-effect, or a deliberate trade-off of the current sequential write implementation. Piecestore dropped piece data into a blob file as soon as it received it, because in case of a failure/cancellation it was always easy to just remove the file. Removing partially written pieces would be more complex with log storage, so instead the code waits until all data of a piece is received, and only then writes it into the file. This write is also not fsynced, so the OS is free to not append piece data immediately, but coalesce with other pieces.

You can force synchronous writes with STORJ_HASHSTORE_STORE_SYNC_WRITES. If you do not, it depends on your OS. For example Linux has parameters like dirty_expire_centisecs to force flushing buffered data after some period of time.

Indeed, hashstore should mostly make SSD caches irrelevant. At the same time hashstore should in general require less memory. And in worst case the equivalent of write caching on an SSD is now just setting up swap on that SSD.

The default is a non-fsynced write, so data is still likely only in RAM when the node reports completion.

jammerdan · September 14, 2025, 5:54am

That’s interesting. So it completely depends on the OS? Then it may be fast when it reports a piece as successfully uploaded that is still in memory and slow if it waits until it has been flushed to disk?

But in the latter if it would flush to an SSD it would still be faster than HDD, correct?

Basically that was what my question was about, why not writing (optionally?) to a SSD first. It’s fast, cheap, non-volatile and likely to can get expanded and faster more easily than RAM and writing to HDD will always be the slowest way even if done sequentially.

From what you are saying I understand at least that directing writes to a SSD (e.g. via caching) could still be beneficial and is not automatically obsolete even with hashstore.

mike · September 14, 2025, 6:02am

You are somewhat right - but you should factor in other things as well.

RAM would always be fastest, no SSD or NVME can compare.

Does you rig experience hard poweroutages on a regular basis ?
I.e. no UPS and regular unstable power distribution.

In that case, maybe you are better of buying a small UPS and setup your server to shut down when it happens. With this setup you won’t need to worry about losing data in such a scenario and also protected the filesystem from corruption (if you use one that would do that).

In end the, it’s a risk decision..

Alexey · September 14, 2025, 6:47am

RAM is faster anyway. You may use a tiered storage with SSD if you so worry about it, but I agree with @mike - it’s better to have UPS instead.

Toyoo · September 14, 2025, 4:00pm

Yep. Your choice, just set the STORJ_HASHSTORE_STORE_SYNC_WRITES flag for the latter.

You could write code to do that, not disputing that. But it would end up more complex, and even now total usage of RAM with defaults should¹ be lower (comparing storagenode reserved memory + the necessary OS metadata cache to make piecestore work vs. storagenode memory with all those buffers and a hashtable of whichever kind). So if you had enough RAM for piecestore, you should as well for hashtable.

Consider that the biggest blob now is just 2.3 MB. Even if you had 100 concurrent writes, that’s just 230 MB of RAM to keep it in memory waiting for a write. This, compared to memory saved on piece metadata on any kind of a medium-sized node, is almost trivial.

¹should, because looking at other threads, there seem to be some problems specifically on Windows.

jammerdan · September 18, 2025, 7:07am

As far as I understand not really a choice as it does not bind the OS. It is rather saying let the OS decide. It can sync while file is in memory but it can also sync after it has been written to disk.

Hence my saying the code is already there in a way. As it was not really intended for such an use case it is probably not very efficient. But what I was saying is with the settings that exist and with a migration policy in place it would be possible to have piecestore location set on an SSD, direct uploads to piecestore and have the pieces migrated in intervals to hashstore.

As piecestore is said to be less efficient, maybe something similar can be done with hashstore only sometimes. Because if it is up to the OS to decide whether or not to write out the file or acknowledge upload sometimes when the file is still in memory and sometimes only after it has been written to the HDD, then there is still a potential for possible performance increase if in such case uploads would be directed to the SSD first and then moved to the JDD later in background. That might bring down slower uploads even further.

Alexey · September 18, 2025, 7:13am

You described the tiered storage in a third time. I do not understand, what prevents you from configuring it? And why do you want to complicate the node with own (likely not ideal) limited implementation of an already matured kernel OS’s function?

Toyoo · September 20, 2025, 4:23pm

Maybe instead figure out when is your OS forcing a write to HDD while holding the storage node from replying to clients, find the root cause, and configure your OS, as opposed to adding what effectively is just a workaround to this behavior? I mean, it looks to me you’re describing a real observation, but instead of solving it at source, you prefer to just deal with symptoms.