Big Spikes in egress

It’s getting even more weird.

Same node from my post above, less than 2 months old, contains on average 0.500TB of data, egress is 1.1TB, so it is double of the data stored.
Considering that the node holds small pieces of data from the large number of customers, it means that all these customers “on average” decided to download their data twice.
From SNO perspective the flow of recent news looks like the following:

  • Announce of payout reduction
  • Drastically reduce payout for egress
  • Announce removal of test data, shut down test sattelites, so SNOs get used to data being removed from the nodes
  • Announce commercial nodes
  • Abnormal egress traffic (customers moving their data away?)

Assuming Storj is playing fair, how else can this be explained from technical perspective:

  • security breach and someone found a way to download/decrypt the data?
  • flaw in uplink/other client software that due to a bug downloads full bucket?
  • Storj onboarded a customer with high egress needs (CDN-like data)?
  • high load test from Storj to see the egress capacity of the network?

Instead of saying “this is normal customer’s behaviour”, would be nice to see technical analysis from Storj explaining this enormous egress traffic. Storj has access to data from satellites as well as billing data and can easily analyze and correlate on what’s happening on the network.

Btw, is it possible to extract from a log file the information on how many times each piece was downloaded?
Would be nice to see the distribution. Is it 100 pieces being downloaded hundreds of times or is it just each piece was downloaded twice.

Tests are done only on test satellites aka test satellite now aka Saltlake. If the high egress is on us1, then this is no test, just customers.
Storj devs can’t report about everything that is hapenning on the network; they have better things to do. Ofcourse I believe they take a look at all out-of-ordinary movements, but don’t have to make reports to the community about everything, maybe only when is a thing of concern. I don’t believe this is it.

3 Likes

Yep it was US1 again. The other sattelites are not closely providing such egress.

grafik

be happy and accept money it brings.

3 Likes

I wonder if there are enough Lambos for all of us? :thinking:

2 Likes

Audi e-tron GT for me, please :wink:

Probably it is just customer data. I’m fine with that. But this has never occured before in the last two years, so in fact this is not normal customers behavior.

When this happens now always or reguarly on US1 and other sattelites it’s fine. Will have to see if that happends every weekend. Currently the customers favor weekends as it seems.

2 Likes

if you asked me, 8.49TB from 9.28TB

Just a quick question to Storj experts.

Should my egress traffic from the dashboard match the sum of Size entries for “downloaded” pieces?

In detail:

  • Cut the log file to keep only several days of abnormal egress traffic
  • Parse the log file entries that has “downloaded” status
  • Extract Size value for all “downloaded” entries
  • Add them up
  • Convert from bytes to GB
  • Compare with reported egress on the dashboard over the same period as the log file
  • Should these numbers match?

Thanks.

You may use the calculation on the Payout information page of your dashboard and the current month. This estimation is based on the local stat stored in the database. The information on the graph is what accounted by the satellites based on sent orders.
They usually should match, but if you see a big discrepancy here - you need to check your databases: https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed- and the orders folder - it should not have unsent orders older than 48h, otherwise part of used egress will not be paid (and will absent on the graph).

You may also request the egress usage from the storagenode’s API:

((curl http://localhost:14002/api/sno/satellites/ -UseBasicParsing).Content | ConvertFrom-Json).bandwidthDaily.where{$_.intervalStart -eq "2023-08-28T00:00:00Z"}.egress.usage | Measure-Object -Sum

the same for bash:

curl -L http://localhost:14002/api/sno/satellites/ | jq '[.bandwidthDaily[] | select(.intervalStart == "2023-08-28T00:00:00Z") | .egress] | reduce .[]  as $item (0; . + $item.usage)'
2 Likes

1.8TB = $36,55?
Are you still calculating 1 TB = $20?

It’s from the dashboard itself. Probably it is using the old values.

Im seeing the same spikes on my nodes from Arizona. Very exciting to see this as customers test the scalability of our service.

A key observation from one of our fantastic engineers is when there is extreme demand for a small set of “hot” files, despite the huge egress spike, the additional load on the nodes is just slightly increased. It looks like in these scenarios storage nodes serve the pieces from the filebuffer in RAM instead of from the disk. With a little extra RAM, storage nodes are not IOPS-bound on the hot file scenario.

This is huge for network throughput capability.

Edit to credit @Ruskiem and @SGC for making the same cache observations.

4 Likes

I use NVME cache for almost all my nodes, for write and read.
So this traffic is nothin special.
trafic

Just to add an observation from my side. It seems there is actually a small amount of files that sees most of the egress. These are my SSD cache stats for an accelerated array that hosts about 45TB of node data.

image

This isn’t new and doesn’t have anything to do with the recent spikes. I do have a sizeable 512GB of SSD cache, but it’s read/write and a lot of it will be used for the many write operations nodes do. It seems egressed data usually falls within a specific 1% of pieces. So cache works really well to serve those faster. That said, I think this does count iops, so it might also be in large part db requests and other small additional reads the nodes make. I also have some other stuff running on this array, but it pales in comparison to the read/write load from Storj.

1 Like

Interesting, as RAM usage on my nodes is fairly negligible.
Is there any way of increasing the size of said filebuffer?

I guess it’s the OS cache that was serving the frequently accessed pieces.

Correct, this is normal for linux, see the following Buffer Cache.

Thank you, both
I’ll have a read and see if there is any way of increasing it :slight_smile:

EDIT: “ Under Linux, you do not need to do anything to make use of the cache, it happens completely automatically.”

There we have it, then :slight_smile:

1 Like

If you are curious, also have a look at Adaptive replacement cache - Wikipedia, which is an improvement over LRU. ZFS is using this mechanism. Here is a more in-depth explanation of its inner working: Activity of the ZFS ARC and this is the second layer, L2ARC: ZFS L2ARC

but ultimatley, recommendation is always the same — add more ram :slight_smile: