Used bandwidth from /api/sno unexpectedly sometimes shows lower values in subsequent calls

jos · December 27, 2024, 5:57am

I’m using the storage node API /api/sno/ to collect some stats across multiple storage nodes (running on the same host in different docker containers). I do this by calling the API every 5 mins, and for a metric of “transferred bytes” I use the bandwidth.used field. I’ve noticed something weird: sometimes (like a few times monthly), the bandwidth.used field can be lower in a subsequent call for the same storage node.

Example from my app logs:

{
  "last_reported_value": 319997,
  "level": "WARN",
  "loop": "storj",
  "message": "Reported value is less than last reported value.",
  "reported_value": 319731,
  "time": "2024-12-10T01:09:53.401202503Z"
}

The value is the bandwidth.used field but in megabytes instead of bytes. Just wondering if this is expected and I should design around it, or if it is to be considered a bug?

Alexey · December 27, 2024, 6:34am

The value can fluctuate because we submit information to the databases in batches, so it should be settled after at least a hour. Also I think some usage could be lost between restarts.
By the way, why do you expect it to be equal or growing? I would assume that’s a date based, so you may have days with lower usage, why this should be a problem?

P.S. this doesn’t affect payouts, because it’s based not on the node’s stat, but on submitted orders to the satellites. Unfortunately the actual usage you can see only after the satellites will send receipts, which will happen after the payout is done.

jos · December 27, 2024, 5:40pm

Thanks for the reply!

By the way, why do you expect it to be equal or growing? I would assume that’s a date based, so you may have days with lower usage, why this should be a problem?

I expect it to sometimes reset, but not go from 319997 to 319731 and then back up again to say 320000. Basically, I call the API every 5 minutes, and use the data like this (made up example):

08:53:00: 0 -> 2000 = 2000 MBs in hour 8
08:58:00: 2000 -> 2400 = 2400 MBs in hour 8
09:03:00: 2400 -> 2600 = 200 MBs in hour 9
09:08:00: 2600 -> 2700 = 300 MBs in hour 9
09:13:00: 2700 -> 150 = 450 MBs in hour 9 (counter reset, due to date or whatever like you mentioned)
09:18:00: 150 -> 300 = 600 MBs in hour 9

Basically I only use the delta between two reported values each time, and by doing this I can get the transferred bytes with the resolution I prefer (by hour in my case). The problem is when the reset is not an actual reset (meaning it doesn’t continue counting from its previous value). This is what I observed:

09:03:00: 2400 -> 2600
09:08:00: 2600 -> 2450
09:13:00: 2450 -> 2650

So it is like it during one of those 5 min interval calls gets a temporarily lower value, but it did not reset the counter. Could this be explained by what you mentioned, updates in batches to the DB?

Alexey · December 28, 2024, 4:59am

Perhaps, but note that you calculated the delta between two usages in time, they are not cumulative. So I do not know what do you want to get from such weird delta? That the traffic is fluctuating? I think you already know it.

To have a proper delta you need to sum the bandwidth usage since the beginning of the month until the point in the past then do the same for the point in the present. The difference between them would be your correct delta - how much more bandwidth have been used.
It may be possible that your calculated delta between moment usages could be close to the difference between accumulated usages for two points in time, but it will fluctuate - you may not have a constant traffic.

jos · December 28, 2024, 6:25am

Hmm, maybe I completely misunderstood what that value is then. When I query /api/sno/ from one of my storage nodes and get a bandwidth.used value of say 500 MB, what does that value represent? Doesn’t it mean that it has used 500 MBs of bandwidth since some time (whether that’s beginning of the month or beginning of the day or whatever)?

If it does, that means that at time t I know the bandwidth used was 500 MB. If I then query it at t+1 and get a value of 525 MB, I know that between t and t+1, an additional 25 MB was used. That is all I’m doing.

Alexey · December 28, 2024, 7:17am

Let me check, maybe it’s me who is wrong

Yes, I was wrong. This value representing the current bandwidth usage. However, seems it’s not rock solid if you requests between update periods.
Do you have fluctuations on 10m interval too?

MarviBiene · December 28, 2024, 8:59am

Hello,

i checked my Database for such errors and its not common in my Database. I save this value every 5 minuts. here are some results i found (Including month resets):
First node:

ID: 5326, Timestamp: 2024-11-01 01:00:02, Current: 0, Previous: 794582967584, Difference: -794582967584
ID: 8770, Timestamp: 2024-11-07 00:30:01, Current: 164659393600, Previous: 165471279404, Difference: -811885804
ID: 20846, Timestamp: 2024-11-28 02:45:01, Current: 750768550960, Previous: 750932443440, Difference: -163892480
ID: 22532, Timestamp: 2024-12-01 01:00:01, Current: 0, Previous: 814300648104, Difference: -814300648104

second node:

ID: 5326, Timestamp: 2024-11-01 01:00:02, Current: 0, Previous: 715762232608, Difference: -715762232608
ID: 7732, Timestamp: 2024-11-05 05:15:02, Current: 121307958736, Previous: 121533573840, Difference: -225615104
ID: 20124, Timestamp: 2024-11-26 19:30:01, Current: 560727691224, Previous: 560842233560, Difference: -114542336
ID: 22532, Timestamp: 2024-12-01 01:00:01, Current: 0, Previous: 634334670524, Difference: -634334670524
ID: 30974, Timestamp: 2024-12-15 16:50:01, Current: 272652553600, Previous: 272744327552, Difference: -91773952

And a node on an unstable system:

ID: 1652, Timestamp: 2024-10-25 16:45:04, Current: 588100340128, Previous: 588349930656, Difference: -249590528
ID: 3520, Timestamp: 2024-10-28 22:30:02, Current: 652102976716, Previous: 652629632992, Difference: -526656276
ID: 5298, Timestamp: 2024-11-01 01:00:02, Current: 0, Previous: 719127361172, Difference: -719127361172
ID: 6448, Timestamp: 2024-11-03 01:05:02, Current: 44395092160, Previous: 44889044416, Difference: -493952256
ID: 11236, Timestamp: 2024-11-11 09:15:02, Current: 226311726944, Previous: 226322358112, Difference: -10631168
ID: 11270, Timestamp: 2024-11-11 10:40:02, Current: 226370197364, Previous: 226400032904, Difference: -29835540
ID: 12366, Timestamp: 2024-11-13 08:25:01, Current: 260255478756, Previous: 260710293496, Difference: -454814740
ID: 12470, Timestamp: 2024-11-13 17:25:07, Current: 262756354612, Previous: 262873144392, Difference: -116789780
ID: 22286, Timestamp: 2024-12-01 01:00:01, Current: 0, Previous: 583474533384, Difference: -583474533384
ID: 30702, Timestamp: 2024-12-15 16:55:02, Current: 275014223724, Previous: 275338056576, Difference: -323832852
ID: 32556, Timestamp: 2024-12-18 22:30:02, Current: 323800089712, Previous: 324040682864, Difference: -240593152

Hope this helps

Edit: As you can see this is all data from the last 2 1/2 months

Julio · December 28, 2024, 9:27am

The more statistical differentials are at the very end of every month, and start of the next month. It’s a reckoning between satellites and node data points, and even less reliable.
Nothing wrong with your delta approach, save for the fact you may need to interpolate the unreliable reporting data to correct for it’s inconsistency.

2 cents,

Julio

jos · December 28, 2024, 11:20pm

@Alexey: Thanks for looking into it and confirming! I’ll change the interval and see if I see the same issue. I think 15min or so should be enough to get somewhat interesting stats on an almost per hour basis. Is it worth reporting this somewhere else as a bug, or is this by design/expected you think?

@MarviBiene: Thanks for checking; this is in-line with what I’m seeing. It’s rare, but it happens and due to my simple logic now it skews the statistics when it does.

@Julio: Yeah, I’ll make the logic a bit more complex if changing to 15 mins does not get rid of it completely. Thanks for chiming in!

Alexey · December 29, 2024, 4:52am

Of course you may report it as a bug on our GitHub repository. Interesting that seems it happens for months now and nobody reported it.

I also have a suspicion that some data become not settled to a database.
Do you have “database is locked” errors in your logs?