Big Spikes in egress

Eugene · August 27, 2023, 5:49pm

It’s getting even more weird.

Same node from my post above, less than 2 months old, contains on average 0.500TB of data, egress is 1.1TB, so it is double of the data stored.
Considering that the node holds small pieces of data from the large number of customers, it means that all these customers “on average” decided to download their data twice.
From SNO perspective the flow of recent news looks like the following:

Announce of payout reduction
Drastically reduce payout for egress
Announce removal of test data, shut down test sattelites, so SNOs get used to data being removed from the nodes
Announce commercial nodes
Abnormal egress traffic (customers moving their data away?)

Assuming Storj is playing fair, how else can this be explained from technical perspective:

security breach and someone found a way to download/decrypt the data?
flaw in uplink/other client software that due to a bug downloads full bucket?
Storj onboarded a customer with high egress needs (CDN-like data)?
high load test from Storj to see the egress capacity of the network?

Instead of saying “this is normal customer’s behaviour”, would be nice to see technical analysis from Storj explaining this enormous egress traffic. Storj has access to data from satellites as well as billing data and can easily analyze and correlate on what’s happening on the network.

Btw, is it possible to extract from a log file the information on how many times each piece was downloaded?
Would be nice to see the distribution. Is it 100 pieces being downloaded hundreds of times or is it just each piece was downloaded twice.

snorkel · August 27, 2023, 7:42pm

Tests are done only on test satellites aka test satellite now aka Saltlake. If the high egress is on us1, then this is no test, just customers.
Storj devs can’t report about everything that is hapenning on the network; they have better things to do. Ofcourse I believe they take a look at all out-of-ordinary movements, but don’t have to make reports to the community about everything, maybe only when is a thing of concern. I don’t believe this is it.

Walter1 · August 27, 2023, 7:50pm

Yep it was US1 again. The other sattelites are not closely providing such egress.

grafik

Vadim · August 27, 2023, 8:07pm

be happy and accept money it brings.

snorkel · August 27, 2023, 8:20pm

I wonder if there are enough Lambos for all of us?

ACarneiro · August 27, 2023, 8:30pm

Audi e-tron GT for me, please

Walter1 · August 27, 2023, 8:31pm

Probably it is just customer data. I’m fine with that. But this has never occured before in the last two years, so in fact this is not normal customers behavior.

When this happens now always or reguarly on US1 and other sattelites it’s fine. Will have to see if that happends every weekend. Currently the customers favor weekends as it seems.

Alexey · August 28, 2023, 1:05am

if you asked me, 8.49TB from 9.28TB

Eugene · August 28, 2023, 1:30am

Just a quick question to Storj experts.

Should my egress traffic from the dashboard match the sum of Size entries for “downloaded” pieces?

In detail:

Cut the log file to keep only several days of abnormal egress traffic
Parse the log file entries that has “downloaded” status
Extract Size value for all “downloaded” entries
Add them up
Convert from bytes to GB
Compare with reported egress on the dashboard over the same period as the log file
Should these numbers match?

Thanks.

Alexey · August 28, 2023, 3:18am

You may use the calculation on the Payout information page of your dashboard and the current month. This estimation is based on the local stat stored in the database. The information on the graph is what accounted by the satellites based on sent orders.
They usually should match, but if you see a big discrepancy here - you need to check your databases: https://support.storj.io/hc/en-us/articles/360029309111-How-to-fix-a-database-disk-image-is-malformed- and the orders folder - it should not have unsent orders older than 48h, otherwise part of used egress will not be paid (and will absent on the graph).

You may also request the egress usage from the storagenode’s API:

((curl http://localhost:14002/api/sno/satellites/ -UseBasicParsing).Content | ConvertFrom-Json).bandwidthDaily.where{$_.intervalStart -eq "2023-08-28T00:00:00Z"}.egress.usage | Measure-Object -Sum

the same for bash:

curl -L http://localhost:14002/api/sno/satellites/ | jq '[.bandwidthDaily[] | select(.intervalStart == "2023-08-28T00:00:00Z") | .egress] | reduce .[]  as $item (0; . + $item.usage)'

jammerdan · August 28, 2023, 3:55am

1.8TB = $36,55?
Are you still calculating 1 TB = $20?

Walter1 · August 28, 2023, 7:36am

It’s from the dashboard itself. Probably it is using the old values.

Dominick · August 28, 2023, 2:35pm

Im seeing the same spikes on my nodes from Arizona. Very exciting to see this as customers test the scalability of our service.

A key observation from one of our fantastic engineers is when there is extreme demand for a small set of “hot” files, despite the huge egress spike, the additional load on the nodes is just slightly increased. It looks like in these scenarios storage nodes serve the pieces from the filebuffer in RAM instead of from the disk. With a little extra RAM, storage nodes are not IOPS-bound on the hot file scenario.

This is huge for network throughput capability.

Edit to credit @Ruskiem and @SGC for making the same cache observations.

Vadim · August 28, 2023, 2:49pm

I use NVME cache for almost all my nodes, for write and read.
So this traffic is nothin special.
trafic

BrightSilence · August 28, 2023, 3:47pm

Just to add an observation from my side. It seems there is actually a small amount of files that sees most of the egress. These are my SSD cache stats for an accelerated array that hosts about 45TB of node data.

This isn’t new and doesn’t have anything to do with the recent spikes. I do have a sizeable 512GB of SSD cache, but it’s read/write and a lot of it will be used for the many write operations nodes do. It seems egressed data usually falls within a specific 1% of pieces. So cache works really well to serve those faster. That said, I think this does count iops, so it might also be in large part db requests and other small additional reads the nodes make. I also have some other stuff running on this array, but it pales in comparison to the read/write load from Storj.

ACarneiro · August 28, 2023, 4:43pm

Interesting, as RAM usage on my nodes is fairly negligible.
Is there any way of increasing the size of said filebuffer?

striker43 · August 28, 2023, 4:48pm

I guess it’s the OS cache that was serving the frequently accessed pieces.

Dominick · August 28, 2023, 4:58pm

Correct, this is normal for linux, see the following Buffer Cache.

ACarneiro · August 28, 2023, 5:00pm

Thank you, both
I’ll have a read and see if there is any way of increasing it

EDIT: “ Under Linux, you do not need to do anything to make use of the cache, it happens completely automatically.”

There we have it, then

arrogantrabbit · August 28, 2023, 5:18pm

If you are curious, also have a look at Adaptive replacement cache - Wikipedia, which is an improvement over LRU. ZFS is using this mechanism. Here is a more in-depth explanation of its inner working: Activity of the ZFS ARC and this is the second layer, L2ARC: ZFS L2ARC

but ultimatley, recommendation is always the same — add more ram