Bandwidth utilization comparison thread

BrightSilence · October 12, 2021, 1:29pm

Hrmm, that’s weird, I have none of those for both repair up and down.

You get 3 projects of 50GB each both storage and egress. But if you still have your old account you may have had the 1TB project. I still have one of those. Keep in mind though that the free limit on the account total is still 150GB for both storage and bandwidth. But that should be plenty to test with.

There used to be a page where we could check stats from similar tests Storj did themselves. But last time I saw that was before the production launch. I have no idea whether these stats still exist somewhere.

I’m not entirely sure what’s going on. I guess it’s possible there are two separate explanations of why repair and normal transfers are slow on your node at that time, but that sounds unlikely.

Pentium100 · October 12, 2021, 1:31pm

I think the main issue is not that the uploads are slow. My node was overloaded so it was slow. The main issue is that somebody appears to have waited 10 minutes or longer for the upload to complete and my node was “one of the faster ones” and was not canceled.

SGC · October 12, 2021, 1:47pm

successrates for the 11th

if you are seeing cancelled or failed it’s almost always latency related in one way or another.
duno why exactly, its just how it is imho

with less load i can get it to triple or quad nines.

BrightSilence · October 12, 2021, 2:37pm

Yeah, I get that, but if that were the case I would expect a flood of complaints. I think that is still more likely to be because the customer end was slow as well. Did you also see a lot of cancelled transfers during that time?

Pentium100 · October 12, 2021, 2:58pm

Over the two hour period (excluding repairs and audits):
“uploaded” 10025
“upload canceled” 81
“downloaded” 1088
“download canceled” 65

So, not a lot of cancels. Average times:
successful upload time: 514 seconds,
canceled upload: 431 seconds,
successful download: 154 seconds
canceled download: 315 seconds

At some point there were 3896 uploads at the same time.

This looks very strange to me. The way uploads are distributed to the nodes should mean that lots of nodes got very similar number of uploads per second (unless there is a way to choose “preferred nodes” and hammer only them, leaving everyone else idle, but I do not think that there is).

The fact that uploads were taking minutes for tiny files and still completing is also strange. To me this would indicate that lots of other nodes were slow (overloaded) as well. This also means that there uploads were actual customer traffic and not some kind of DoS attack (though the customer may have been trying to DoS the network, he appears to have been doing so using normal upload procedure).

During that time, uploads from multiple satellites were just as slow. All of the slow uploads could have been coming from the same customer, but this is not very likely - why would a customer use multiple accounts on different satellites at the same time and how did he get an account on “saltlake” which, to my knowledge, is a testing satellite?

SGC · October 12, 2021, 3:07pm

i’ve seen something similar when i stall my pool dead with io, then there will be a point where i just get upload and downloads started, but almost no completed or none at all…

then when the system wakes back up, it will rapidly go through all of them.
haven’t really figured out why it happened, seems to maybe having been related to my arc min setting, which was like the default of 16MB or something and then it would flush the entire arc at one time which would lead to a near stall condition under already heavy load.

can’t say i track the up and downloads this way, so i got no idea about how many i get concurrent, you can give it a max… i’ve been thinking of doing that, but seems like i finally got everything back to an acceptable state.

i do know i have seen hundreds of started at one time over just 2 minutes in the logs… so its certainly possible, while my pool was stalled out… so thats a rough hole to dig out off…

BrightSilence · October 12, 2021, 3:15pm

So I’ve looked back at my log during the same time and all pieces finished quickly as normal. So I can’t really explain what happened. I doubt my setup is that much faster than yours. (Though I guess I do have SSD write cache)

Will the complete successfully though?

If so, I’d say it’s possible that the transfer actually finished a long time ago, but something is delaying the log write.

SGC · October 12, 2021, 3:22pm

yeah it can be stalled for much longer than i would have thought… alexey mentioned an upload can wait up to like 30 minutes but maybe i misunderstood something there…

it will also ofc cancel or fail some of them, but it can like have hiccups where it basically doesn’t fail… just sort of stalls dead and then a few minutes later catchs up by just succeeding most of them… i can’t say all because i haven’t tracked them individually…

it also creates … obviously super high bandwidth spikes, because all of them is basically processed at one time.

BrightSilence · October 12, 2021, 3:29pm

But normally long tail cancellation would take care of cancelling the slowest transfers. If something like that happens you would expect all of them to be cancelled or error out.

I’ve had a suspicion for a while that a lot of cancelled transfers actually end up being logged as successful. Quite a while ago now when they changed how things were logged we all saw success rates jump to near perfect. Almost all SNOs who’ve reported these numbers show around 99% success. But we know that 10/39 downloads are cancelled and 30/110 uploads are. We should all be seeing much lower percentages than we are seeing. Maybe these are incorrectly logged as successful?

@Pentium100: Can you look up some of the pieces mentioned in the logs? Are they actually on your disks? Are they in trash? (you may need to wait until garbage collection is triggered for the corresponding satellite though)

Pentium100 · October 12, 2021, 3:46pm

How often does the garbage collection run?

I’ll try to find out where the pieces are.

BrightSilence · October 12, 2021, 4:24pm

Looks like the default is every 120 hours.

github.com

storj/storj/blob/1fdb0eaa5b8b5bac28ebf564cb5aaf08d546eefe/satellite/gc/service.go#L32

    
      
          )
          
          
var (
          	// Error defines the gc service errors class.
          	Error = errs.Class("gc")
          	mon   = monkit.Package()
          )
          
          
// Config contains configurable values for garbage collection.
          type Config struct {
          	Interval time.Duration `help:"the time between each send of garbage collection filters to storage nodes" releaseDefault:"120h" devDefault:"10m" testDefault:"$TESTINTERVAL"`
          	Enabled  bool          `help:"set if garbage collection is enabled or not" releaseDefault:"true" devDefault:"true"`
          
          
	// value for InitialPieces currently based on average pieces per node
          	InitialPieces     int           `help:"the initial number of pieces expected for a storage node to have, used for creating a filter" releaseDefault:"400000" devDefault:"10"`
          	FalsePositiveRate float64       `help:"the false positive rate used for creating a garbage collection bloom filter" releaseDefault:"0.1" devDefault:"0.1"`
          	ConcurrentSends   int           `help:"the number of nodes to concurrently send garbage collection bloom filters to" releaseDefault:"1" devDefault:"1"`
          	RetainSendTimeout time.Duration `help:"the amount of time to allow a node to handle a retain request" default:"1m"`
          }
          
          
// Service implements the garbage collection service.

That’s assuming all satellites run with the default setting.

jammerdan · October 12, 2021, 4:53pm

Do you use –log-opt mode=non-blocking ?

BrightSilence · October 12, 2021, 5:23pm

I don’t use docker logs and I’m pretty sure @Pentium100 doesn’t either.

SGC · October 12, 2021, 5:58pm

that’s interesting, i’m still being troubled by semi random shutdowns, out of memory seems to occur every approximately 4 days in some cases.

but have been running with less than 1GB of RAM for those nodes… so maybe garbage collection simply uses more RAM, i can see on my main node it seems to spike at about 1.5GB for a short period before dropping back down.

going to try upping the allowed memory usage.

checked the docs system requirements, but doesn’t seem to mention RAM.
i do seem to remember it being 1GB but might be confusing that with cores.

jammerdan · October 12, 2021, 5:58pm

I see. I am using that instruction and I see the same behavior you have reported sometimes. I thought it was because of the log messages being queued.

Pentium100 · October 12, 2021, 7:31pm

I use docker logs, but without that parameter.

Are the file names for the pieces the same piece ID or are they encoded differently, like the satellite IDs?

Crunk_Bass · December 7, 2021, 3:54pm

Is anyone else seeing high bandwidth usage for the last 1,5 days?
Average outgoing traffic is at 62 Mbit/s

Stob · December 7, 2021, 4:48pm

Egress does look good today, and for some of yesterday on my node. Definitely not as much as yours though!

One oddity I have seen recently is simultaneous downloading from multiple IP’s at the same speed:

jammerdan · December 7, 2021, 5:15pm

Maybe this:
https://twitter.com/internetarchive/status/1467478175374614531

TheMightyGreek · December 7, 2021, 5:35pm

The new node I just spun up is going crazy, 6.61GB egress with only 16GB stored.
6.55 of those are from us1 !