Avg disk space used dropped with 60-70%

Alexey · July 20, 2024, 10:11am

if you are already uses Grafana, you may calculate your own estimation, even if the report from the particular satellite is missing. Also, do not account missing days in your average calculation.

jammerdan · July 20, 2024, 10:42am

Maybe I’ll have to do something like that.
Currently I am using the already averaged data from the API and not the values per day.

Normally I think if there is no data for a day, it should not be calculated with 0 as value. This sounds wrong. If there is no data, this does not mean zero. It means data is missing, so maybe the last available data should be used instead.

But again, what use does data have, that I calculate how I like it. This is not what I would like to see. I need accurate satellite data I can rely on. If I make my own calculations I can make up anything.

Alexey · July 20, 2024, 11:26am

perhaps you need to use the latest available report instead. Seems the average is way off with these gaps (and I do not expect that it would be changed, unfortunately, at least now).

yes, I agree, however, we have what we have now. And it’s unlikely will be fixed soon…

right now it’s not guaranteed, I’m sorry. It will be precise for the past month, but not for the current period, at least now. So, the estimation would be… an estimation?

zip · July 20, 2024, 11:35am

Using the Prometheus endpoint to calculate the payouts for example is not reliable at all, as I looked and it looks like there is no TBm value nor payout exported there.
Thus you can only use the node or sat reported instant used space and mutliply that by the payout rate, but as I have realized during these load tests, this is not representative of the paid data at all, as the the node might have gained 10TB in the last week of the month, but of course you won’t get paid for 10TBs if the node only held it for 7 days. And the sat data have gaps in them as well, so this gets even more unreliable if there was no BF for the last two weeks while some data were actually purged from the network.
So after abandoning the Prometheus node_exporter written in Python, which grabbed the data via HTTP API, and after migrating to node Prometheus endpoint, I’m no longer wasting time to look at those metrics, as they are either missing or are simply wrong.
And it helps with the anxiety not to look at that every day and try to solve all the unsolvable problems by throwing more and more hardware at the thing, or by making purchasing decisions based on predicted payouts that are simply not true.

Alexey · July 20, 2024, 2:42pm

You are very correct, unfortunately. Right now, we (SNO) do not have a reliable source of truth. The node paid for the submitted orders, but our dashboard doesn’t use this info! It uses the (probably) correct data from the databases about used space, and some scripts are trying to compare the reported (unreliable) used space from the satellites with the (not reliable too) information from the local databases (which are updated only in a happy case - no one filewalkers are failed, no database errors)… I would assume, that using a dashboard for anything is waste of time, unless you do not have any errors, related to a databases/filewalker (are you?)…

zip · July 20, 2024, 3:07pm

Exactly.
Maybe the solution would be to make the Storagenode headless (very ideally with an option to dedicate the whole mountpoint, so we will no longer have to babysit the available space), and make the dashboard sort of a cloudy one on the satellite side.
To get what the satellite thinks we have, we would login or call some API, and to get what the actual used space is, we can use the tools each OS has. And then we can compare these values on our own to see if there might be some potential problems.

I also quite don’t get why there is a satellite job running each day to recalculate the used space on all the nodes.
Why not making this realtime in some sort of an in memory database, such as Redis, where after submitting the orders the database would simply be updated with used_space += uploaded_piece_size, egress += downloaded_piece_size for each node - two update statements per each node per order I assume, and some more in case stuff will be deleted from the network.
Then all these recalculations would be only needed as a source of truth for the payout purposes and this way there would be no late data, no gaps etc.

Roxor · July 20, 2024, 3:21pm

Something like a “storage.use-os-free-space: true” config.yaml flag would be really nice (even if you maybe also had to specify the mount/drive to go with it).

I don’t care how much space a node uses: just that it tries to leave 500GB free.

Alexey · July 21, 2024, 6:25am

It would be a little bit more complicated. The satellite calculates the used space based on a signed orders, sent from the nodes. That’s mean that statistical information should be added in the same request from the customer and it will slow down the response. We do not want to slow down, we need to speedup!
So, I do not think that collecting the statistic on the fly is a good move. Thus it’s a separate process. Summarizing the data is never fast, as far as I know.

Or do you suggest to implement it on the node’s side? Because right now the node doing exactly the suggested steps. When the upload is finished, it increases the used space in the database. If the piece is trashed, it reduces the used space and increases the trash used space. If the piece is deleted due to expiration (TTL), it reduces the used space. If the piece is deleted from the trash due to expiration (trash-filewalker), it reduces the trash used space.
Databases are cached in memory and flushed to the disk every hour by default. So, some kind of temporary REDIS.

Alexey · July 21, 2024, 6:27am

That’s a good idea! And specify it for the nodes which uses the whole disk/dataset/partition or setup quotas.
Could you please add it to the Storage Node feature requests - voting - Storj Community Forum (official)?

zip · July 22, 2024, 11:31am

I cannot imagine this would bring any noticeable performance hit, especially as satellites communicate with nodes and customers over long distances with significant latencies.
You can also make this call async and delegate it to another process, so it won’t slow down the usual stuff.

On the satellite side.
On the node side it was fine until the satellites stopped informing the nodes about the deleted pieces in realtime. Now to estimate how much data is unpaid we have to wait for sat report, which lately is very unreliable because it takes long.
How long will it take to count if the network will be 10 or even 100 times as big as it is now?
And to speed it up probably won’t be as easy as throwing more cores and memory at it, as you would have done that already if that would be the case.
So fundamentally this approach isn’t a good one I think.

nyancodex · July 22, 2024, 5:14pm

You are right, we must find a way to speed up the process, if the network is too big to a point that the calculation can’t finish in few days (at the end of a month) then how long SNOs must wait to get paid?

Roxor · July 22, 2024, 5:32pm

SNOs have been getting paid at the start of every month, like clockwork, for years. Haven’t they?

Numbers in the node UI have nothing to do with if nodes get paid or not.

nyancodex · July 22, 2024, 5:46pm

Yeah the clockwork varies from the 4th to 14th of the month, lol. And I don’t say for today or tomorrow.

Alexey · July 23, 2024, 3:34am

This usually depends on when all nodes would submit orders (up to 48h), working days (because if the payout would stuck on weekend we need someone, who would restart the payout or fix the issues if any), some retries (because of fee fluctuations) to send as much payouts as possible and some issues like the last one when the zkSync Lite suddenly stopped to support to pay fee with STORJ tokens. We waited a little bit longer to get a response from the Matter Labs, and then sent the stuck zkSync Lite payout via L1 as a fallback (with its Minimum Payout Threshold unfortunately, but with 3% bonus as well).

Alexey · July 23, 2024, 3:46am

I do not think that the not sent report from the long running tally affects the payout.

nyancodex · July 23, 2024, 8:50am

I have more questions. So these numbers can be trust, or not? Because the 2.92TBm is totally wrong due to the wrong chart. Those numbers are for local display only?

jammerdan · July 23, 2024, 9:50am

There’s a discrepancy for me between the satellite’s average usage across all nodes, which is less than 50% of the nodes’ own used space reports. Unfortunately, neither source appears to be trustworthy.

Toyoo · July 24, 2024, 1:44am

Do you have a plan for the next month?

Roxor · July 24, 2024, 2:36am

Lite support has ended, so it’s not a problem next month.

“Future payouts will simply choose your preference between zkSync Era and Layer 1”

You should be using a L1 exchange address and immediately selling to fiat anyways

EasyRhino · July 24, 2024, 3:14am

I endorse this advice unless/until it’s possible to move zksync-era to a fiat cashout without triggering nasty ethereum gas fees.