Reasoning for orders expiring after 48hrs?

jammerdan · November 12, 2020, 7:58am

The issues around corrupted orders brought it to the surface that orders expire after 48 hrs.
Meaning failure to detect issues with sending of orders result in loss for the SNO.

Why is this this way? From what I understand the orders are just a kind of receipt that the node has performed the requested work and can get paid. From my understanding this means that a customer who has downloaded a file has received the required parts from that node and will get billed for it. Thus, Storj gets paid but refuses to pay the node if the order is older than 48 hrs?

Or maybe I am wrong with that assumption? Why the need for expiring the order despite the node has completed his task?

littleskunk · November 12, 2020, 8:32am

Negative. If the storage node is not submitting an order the customer will also not get charged.

Double spend protection. The storage node and the satellite have to make sure that an uplink / storage node is not submitting the same order more then once. This check comes with some tradeoffs. One of them is that orders need to expire after a few hours.

Under normal conditions this shouldn’t matter. The storage node has a backlog of 2 hours unsibmitted orders. If it goes offline for 3 days it will just lose these 2 hours. The same would happen if we increase the expireation time to lets say a week. The storage node could still lose these 2 hours of unsubmitted orders.
The only issue we get is when bugs occure. Let’s say the storage node is not submitting any orders at all. It took some operators more then a month to notice that. Even if we would have an expire time of let’s say 7 days most of the order would still expire. For that reason we prefer to fix the bug instead of increasing the expiration time.

Something interesting to read: https://review.dev.storj.io/c/storj/storj/+/1732/10/docs/blueprints/sparse-order-storage.md

jammerdan · November 12, 2020, 9:13am

That’s exactly the problem. If a SNO doesn’t check logs or (check the unsent folders) on a regular basis, there is no information for the SNO if the node really gets paid correctly. In fact, 1 single corrupted order was enough to block sending orders thus payment for months. I see you have changed that with version 1.16.1 but other bugs or corruption can occur. And if that happens, 48 hrs are really really short period to lose all payment for completed work.

As losing orders seems to be really relevant, maybe it should be tracked and displayed on the dashboard.

So we as SNOs could run a free Tardigrade inside Tardigrade, if we decide to delete all orders?

SGC · November 12, 2020, 1:31pm

might be a good idea to have a failsafe tho, in case orders gets blocked for whatever reason…

ofc devising a failsafe can sometimes be quite the challenge lol, still working on some of my own node safety features and monitoring.

i think i have to read your comments again because all i took away from it was fixing double spend issue. xD

SGC · November 12, 2020, 1:36pm

failsafe could be something as benign as having a green light turn red or yellow in the dashboard, because it registers that there is a problem, because the node isn’t getting new orders or whatever…

ofc monitoring comes at performance costs… but still there would be a few critical spots in the storagenodes that will register issues because it will grind to the halt, and tho it might not be able to self repair… it might be able to more easily show when something is wrong in the dashboard.

good idea? (Y/N)

littleskunk · November 12, 2020, 1:37pm

Any time frame would be too short. I don’t think increasing it would help here.

I would recommend using something like grafana and enable email notification for anything that could go wrong.

Kind of yes. If all storage nodes suppress order submission the customer would still get billed for used space. So it wouldn’t be free.

BrightSilence · November 12, 2020, 2:03pm

Why is the node even bothering to do anything with orders older than 48h? It seems that a lot of this issue can be reduced if the node only tries to process orders that haven’t expired yet. If it just ignored files older than 48 hours it would at least process all subsequent orders and just miss one or two hours if corruption occurs.

jammerdan · November 12, 2020, 2:05pm

I agree. It sounds silly if a node processes thousands and thousands of orders that are already expired.

littleskunk · November 12, 2020, 2:06pm

Because these orders are unsubmitted. In the first place, the storage node is trying to submit all unsubmitted orders.

BrightSilence · November 12, 2020, 2:07pm

Is the node not aware of the 48h expiration? Because if it is, what is the use of submitting the expired ones?

SGC · November 12, 2020, 2:32pm

so maybe it should submit the new orders first and then work it’s way towards to bottom + have a sort of timeout so it doesn’t get stuck on orders older than 48hr or whatever… and then simply start over again, so that even if it stacks up a ton of unprocessed orders they will not end up stopping the node’s “critical” operations, which is kinda what i would define payment as…

ofc i do also understand that sorting it this way around would have a bit of a skipping CD kinda effect… because if it got overloaded it would just basically abandon what was skipped and then if it had time for it the orders would get submitted, else it would just continue like it didn’t happen… which might not be the best solution either…

ofc it’s easy for us to give suggestions when we don’t fully understand the code and parameters affecting this particular function, i certainly don’t know how it works or should work…

but i know it wouldn’t crash or get stuck, and that there might be simple ways to limit, if not completely avoid similar failures in the future.

think i’m getting one of those feature suggestion vote tingles. xD

Toyoo · November 12, 2020, 7:27pm

Two questions:

Given that the storage node collects statistics on bandwidth, would it also make sense to have a parallel set of statistics that only acknowledge bandwidth after successfully submitting orders? Then it would be easy to, let say, access this data in @brightsilence’s script.
Given the impact of unsent orders, maybe the node should shut down/not start if it was unable to send orders for some time?

littleskunk · November 12, 2020, 9:53pm

The storage node should have access to the expiration timestamp.

I don’t think that is the expected behavior. Even if the storage node keeps submitting the expired orders I would not question why the developer implemented it that way. The obvious answer would be that he simply missed one validation. If the storage node is not checking the expiration time then this order is just an unsubmitted order like all the other orders.

So if the storage node is failing to process orders we make sure that his script will also show 0 paid download traffic instead of pointing out the issue? I don’t think we should change the script. For my storage node I love the fact that the payout based on orders is almost a perfect match with storage node bandwidth accounting. I would like to keep that external verification.

If I am not mistake that was the fix in v1.16. I didn’t follow the entire issue. I might be wrong on this.

BrightSilence · November 12, 2020, 10:11pm

That makes sense. I guess it also depends on how expiration would be verified. If it uses data inside the orders file, then this requires the file to be not corrupted to begin with. If I stead of would only bother processing files based on their creation date it could skip trying to read the file altogether. Though all of this would of course still be a workaround of the actual issue and might also make it harder to detect when an actual issue occurs.

I completely agree with this. That is the entire reason for the existence of the script. Though @Toyoo may have been suggesting adding this as additional info to the script. That would have been easier back when the orders were still inside a db. Having to parse a proprietary file format of a lot of separate files doesn’t really seem worth the effort. Especially since this would just point out one possible issue while there could be many more. And we already have fairly easy ways to check for this specific issue should a difference between the earnings script and payouts pop up.

jammerdan · November 13, 2020, 7:07am

Maybe not shut down, but easy access to the information if there are issues with sending orders is really essential before orders start to expire.

Alexey · November 13, 2020, 7:09am

From the other side - if your node is keep crashing you should notice this on a very early stage

jammerdan · November 13, 2020, 7:12am

You mean if it gets shut down on unsent orders?

Alexey · November 13, 2020, 7:14am

Yes. This is one of the critical job of the storagenode, so if it’s cyclically crashing - you definitely should notice that before the orders got expired.
This is not a solution, but it could help to fix it before it will too late.

jammerdan · November 13, 2020, 7:19am

But it would hurt the uptime score and potentially could lead to disqualification which is absurd for a node that provides its service flawlessly for free.

I don’t want the node to shut down. In case the SNO is not available shutting down the node makes things worse.
And if it is a bug in the node software SNO remains with a node that keeps shutting down over and over again.

This does not sound too good either.

Alexey · November 13, 2020, 7:23am

I have to agree, but the other alternative is to lose bandwidth payments and give clients free bandwidth from your node until the bug is fixed.