Follow up on When will "Uncollected Garbage" be deleted? - #89 by BrightSilence
“incorrect Uncollected Garbage” is a major pain point for SNOs, the gist of it - the slow/incomplete calculation of storage node usage for a given period, and I don’t think any generic database have what it take to do this task efficiently.
I thought of an algorithm to calculate sum of storage for a node in a range of time, it can answer correctly how much data on a node, at light speed. Here is how it work:
Assuming we want to query: how much storage node A have between Aug 4 2019 (2019-08-04 09:44:04) until Aug 4 2024 (2024-08-04 09:10:05).
We need 2 engines:
1. Time splitting engine, it will able to split that time range into:
- from 9h:44:04 (Aug 4 2019) until 00:00 (Aug 5 2019) # will discuss granular level below
- from 00:00 Aug 5 2019 until 00:00 Aug 6 2019 # daily
- from 00:00 Aug 6 2019 until 00:00 Aug 7 2019 # daily
- …
- from 00:00 Aug 31 2019 until 00:00 Sep 1 2019 # daily
- from 00:00 Sep 1 2019 until 00:00 Oct 1 2019 # monthly
- from 00:00 Oct 1 2019 until 00:00 Nov 1 2019 # monthly
- from 00:00 Nov 1 2019 until 00:00 Dec 1 2019 # monthly
- from 00:00 Dec 1 2019 until 00:00 Jan 1 2020 # month
- from 00:00 Jan 1 2020 until 00:00 Jan 1 2021 # yearly
- from 00:00 Jan 1 2021 until 00:00 Jan 1 2022 # yearly
- from 00:00 Jan 1 2023 until 00:00 Jan 1 2024 # yearly
- from 00:00 Jan 1 2024 until 00:00 Feb 1 2024 # monthly
- from 00:00 Feb 1 2024 until 00:00 Mar 1 2024 # monthly
- from 00:00 Feb 1 2024 until 00:00 Mar 1 2024 # monthly
- …
- from 00:00 Jul 1 2024 until 00:00 Aug 1 2024 # monthly
- from 00:00 Aug 1 2024 until 00:00 Aug 2 2024 # daily
- from 00:00 Aug 2 2024 until 00:00 Aug 3 2024 # daily
- from 00:00 Aug 3 2024 until 00:00 Aug 4 2024 # daily
- from 00:00 Aug 4 2019 until 09:10:05 Aug 4 2024 # will discuss granular level below
2. Sum of storage for those time range above:
No comment as it pretty self explanatory.
The trick here is the space-time tradeoff when we precompute those value.
Another example: we want to report how much storage usage from a node for a given month, that pretty much instant because we already precompute that value - an O(1) read operation.
We have to decide between how much granular level we want to achieve vs how much storage we can tolerance (I assume hourly is good enough here?). It look like this:
If when input some raw data is wrong and we want to update it, simply open a transaction that also update the upper hour, day, month and year as needed.
P/s: edit to make this section clearer, for example, an original data point is 13, now it become 42, the delta is +29, just +29 to hour/day/month/year above in a transaction.
Another issue I currently saw is shift in usage of google spanner (database), StorJ is an opensource project, but with usage of Spanner, it pretty much tie itself into GCP with no escape. If either Google or StorJ no longer exists, community cannot inherit much from it, would StorJ reconsider it…
P/s: nevermind about cockroachdb - CockroachDB License Change | Hacker News