Hrmm, that’s weird, I have none of those for both repair up and down.
You get 3 projects of 50GB each both storage and egress. But if you still have your old account you may have had the 1TB project. I still have one of those. Keep in mind though that the free limit on the account total is still 150GB for both storage and bandwidth. But that should be plenty to test with.
There used to be a page where we could check stats from similar tests Storj did themselves. But last time I saw that was before the production launch. I have no idea whether these stats still exist somewhere.
I’m not entirely sure what’s going on. I guess it’s possible there are two separate explanations of why repair and normal transfers are slow on your node at that time, but that sounds unlikely.
I think the main issue is not that the uploads are slow. My node was overloaded so it was slow. The main issue is that somebody appears to have waited 10 minutes or longer for the upload to complete and my node was “one of the faster ones” and was not canceled.
Yeah, I get that, but if that were the case I would expect a flood of complaints. I think that is still more likely to be because the customer end was slow as well. Did you also see a lot of cancelled transfers during that time?
Over the two hour period (excluding repairs and audits):
“uploaded” 10025
“upload canceled” 81
“downloaded” 1088
“download canceled” 65
So, not a lot of cancels. Average times:
successful upload time: 514 seconds,
canceled upload: 431 seconds,
successful download: 154 seconds
canceled download: 315 seconds
At some point there were 3896 uploads at the same time.
This looks very strange to me. The way uploads are distributed to the nodes should mean that lots of nodes got very similar number of uploads per second (unless there is a way to choose “preferred nodes” and hammer only them, leaving everyone else idle, but I do not think that there is).
The fact that uploads were taking minutes for tiny files and still completing is also strange. To me this would indicate that lots of other nodes were slow (overloaded) as well. This also means that there uploads were actual customer traffic and not some kind of DoS attack (though the customer may have been trying to DoS the network, he appears to have been doing so using normal upload procedure).
During that time, uploads from multiple satellites were just as slow. All of the slow uploads could have been coming from the same customer, but this is not very likely - why would a customer use multiple accounts on different satellites at the same time and how did he get an account on “saltlake” which, to my knowledge, is a testing satellite?
i’ve seen something similar when i stall my pool dead with io, then there will be a point where i just get upload and downloads started, but almost no completed or none at all…
then when the system wakes back up, it will rapidly go through all of them.
haven’t really figured out why it happened, seems to maybe having been related to my arc min setting, which was like the default of 16MB or something and then it would flush the entire arc at one time which would lead to a near stall condition under already heavy load.
can’t say i track the up and downloads this way, so i got no idea about how many i get concurrent, you can give it a max… i’ve been thinking of doing that, but seems like i finally got everything back to an acceptable state.
i do know i have seen hundreds of started at one time over just 2 minutes in the logs… so its certainly possible, while my pool was stalled out… so thats a rough hole to dig out off…
So I’ve looked back at my log during the same time and all pieces finished quickly as normal. So I can’t really explain what happened. I doubt my setup is that much faster than yours. (Though I guess I do have SSD write cache)
Will the complete successfully though?
If so, I’d say it’s possible that the transfer actually finished a long time ago, but something is delaying the log write.
yeah it can be stalled for much longer than i would have thought… alexey mentioned an upload can wait up to like 30 minutes but maybe i misunderstood something there…
it will also ofc cancel or fail some of them, but it can like have hiccups where it basically doesn’t fail… just sort of stalls dead and then a few minutes later catchs up by just succeeding most of them… i can’t say all because i haven’t tracked them individually…
it also creates … obviously super high bandwidth spikes, because all of them is basically processed at one time.
But normally long tail cancellation would take care of cancelling the slowest transfers. If something like that happens you would expect all of them to be cancelled or error out.
I’ve had a suspicion for a while that a lot of cancelled transfers actually end up being logged as successful. Quite a while ago now when they changed how things were logged we all saw success rates jump to near perfect. Almost all SNOs who’ve reported these numbers show around 99% success. But we know that 10/39 downloads are cancelled and 30/110 uploads are. We should all be seeing much lower percentages than we are seeing. Maybe these are incorrectly logged as successful?
@Pentium100: Can you look up some of the pieces mentioned in the logs? Are they actually on your disks? Are they in trash? (you may need to wait until garbage collection is triggered for the corresponding satellite though)
that’s interesting, i’m still being troubled by semi random shutdowns, out of memory seems to occur every approximately 4 days in some cases.
but have been running with less than 1GB of RAM for those nodes… so maybe garbage collection simply uses more RAM, i can see on my main node it seems to spike at about 1.5GB for a short period before dropping back down.
going to try upping the allowed memory usage.
checked the docs system requirements, but doesn’t seem to mention RAM.
i do seem to remember it being 1GB but might be confusing that with cores.