Saltlake used space wrong calculation on dashboard that can affect payout calculation... again

zeebo · May 30, 2020, 6:41pm

I’m sorry about all the confusion about payments. Getting this right is very important to building a healthy relationship with node operators and so I’m going to try my best to explain what’s going on and how we’re working to fix it. I’ll try to explain the problem that we’re trying to solve, the current architecture we’re using to solve it, the problems with the implementation, our current workarounds for those problems, and finally things we’re doing in the future to fix the architecture.

To start, the problem we’re trying to solve is compensating storage node operators for the amount of space and bandwidth they are using in a way such that they cannot cheat and get extra, while ensuring that they are incentivized to stay online to maximize their earnings. For example, because a single bad actor could sign up and exploit any issues with this accounting, we can’t do the obvious and simple thing like just have the nodes report the amounts they earned and trust it.

Next, I’ll describe how we’re solving the problem. Bandwidth and storage accounting are two separate systems. I’ll start by explaining storage accounting. Each satellite contains a large database with an entry for each file stored in the network. That entry contains which storage nodes are storing the data and how much data is on each node. Since we pay for byte*hours of data stored on the network, we periodically walk the entire database and try to get an estimate of how much storage is used on that node at that instant in time, and then multiply that by the number of hours since the last full walk. Bandwidth accounting is done during an upload or download operation initiated by an uplink. The satellite returns a list of nodes for the operation as well as signed order limits. These order limits contain which node it is for, the type of operation, etc. The uplink then signs the amount of bandwidth it uses with the order limit and provides that to the nodes. Eventually, the nodes submit these order limits back to the satellite. A process then batches up these order limits, removes any that have already been processed before, and increments counters for the nodes for how much bandwidth they’ve used.

Next, the problems with our implementation. Fundamentally, they both come down to scaling the database to be fast enough. For example, if it takes over 24 hours to look at every file in the database, you won’t have any entries for how many byte hours have been used for that day. This explains the spikyness in those graphs: the amount of time it takes to walk the database and when that happens will cause the data points to have more or less byte*hours, but it will be correct on average because we look at how long it was since the last walk, rather than expect it to happen on a tight schedule. And for the bandwidth, if we cannot consume the queue faster than entries are being added to it, we’ll have an increasing backlog of entries forever. That’s what’s been happening.

Our current workarounds involve tuning the database and queries used in these systems to be as fast as possible, tuning the sizes of the batches we operate on, etc. Some of the tuning takes longer to perform, such as changing where and how the database is hosted. Because changing where it is hosted is a large risky thing to do, and as long as we don’t lose the submitted orders, they will be paid out eventually, we’ve waited to do it until other less risky measures were explored. We’ve recently switched how it is hosted for the saltlake satellite, and with some final tuning to the query and batch sizes, I expect we’ll be able to get though the (substantial) backlog quickly.

As for why you see no bandwidth instead of a slow but low trickle of bandwidth is due to an unforseen detail of the way the orders are consumed. Because we may eventually use CockroachDB for this database, to aid in the transition, the schema was designed with CockroachDB in mind. Specifically, because it is built on RocksDB, it has a feature called “prefix compression”. That means that if you have a primary key for some row such that prefixes of the primary key are often the same, you get space savings. And so the schema places the node id at the start of the primary key so that they are compressed “for free”. Unfortunately, this has the side effect of making it so that nodes with earlier ids are processed first, and if the queue is growing faster than we can consume it, they end up being the only ones processed.

Finally, as for what we’re doing to fix this in the future, we’re designing and building a way of avoiding double spends by keeping track of which orders have been spent inside of a Sparse Merkle Tree. The blueprint is at https://review.dev.storj.io/c/storj/storj/+/1732. This will allow us to avoid doing more than a single read/write to the database when processing a batch of orders and will allow us to avoid storing more than a single hash to prevent double spends. Currently, our database in terms of size (and probably IOPS) is 99% handling the orders, so it is expected to be a massive win and will prevent issues like this from ever happening again. Additionally, I’m going to be adding metrics in to the system that will allow us to easily ensure we’re making speedy progress on these critical systems.

I hope the above explanation helps provide more context for the issue and helps regain some of the trust that has been lost. I’m working over this weekend to get all of the backlog orders processed for this next billing cycle, and will do whatever it takes to make sure that happens.

Thanks.