Strange load (like DDoS attack)

Load average is a historic metric and can be misleading at times. The actual value is a sum of processes/threads executing AND processes/threads waiting for cpu/io (average over 1/5/15 min).

Number of executing can not be higher then number of CPU cores on the system and since you probably don’t have 1000+ cores, it’s an indication that there’s a large number of processes/threads waiting for cpu/io.

It looks like CPU is actually mostly idle so they are waiting for IO which is likely disk or network IO or both. Any disk can handle limited operations per second and once this limit is exhausted processes start to queue up for disk IO and while they are waiting for response they bring load average up.

Though 1000+ threads for garbage collector probably indicates a bug in software, excessive forking or palatalisation, spawning new threads while existing ones didn’t complete? I believe garbage collector is not time sensitive and shouldn’t be spawning threads so aggressively. Something Storj devs can confirm.

Please hold on, I think issue on my side, at this moment I during testing and proving.
I update this post when I finish my tests.

After long time investigation, I found a few issues on my infrastructure (ESXi infrastructure):

  1. Datastore where located storj data have VMFS 6 and feature “Space reclamation” was enabled and rate limited. Solution: Disable this feature (no make sense for storj)
    image

  2. Storj datastore inside VM was mounted with “discard” options in fstab, “discard” is using by space reclamation feature of VM Ware and reclaim unused or deleted blocks on the fly. Solution: remove “discard” option from fstab and remount storage.

  3. Issue 2+1 also have impact ZFS performance and significantly slow down storage during massive delete operations, but read/write operations is normal (I always use fio for tests everything, but never have tests for delete operations)

So, now the issue is solved. I apologize for reporting the wrong issue, it was on my infrastructure. I hope my investigation can help someone with a similar issue.

3 Likes

Thanks for the tips regarding ESXi environment!

This is all causing a lag, which is reasonable, but if a drive got dropped out of your ZFS array, this couldn’t have been caused by software. Either you have a bad drive or there was a read error and the drives didn’t have TLER or something of the sort.

You are welcome! :slight_smile:

About dropped disk: This is WD Red, and of course, it has TLER and other RAID features. After fail, I checked everything on failed drive (SMART, Surface etc.) and found nothing. But after I analyzed logs I found that drive was dropped by response timeout (no response). So, it was really DDOS for my storage system… but lesson learned. and I implemented another feature for my array than prevent any situation like I have before. Also, I sucsefuly back “failed” drive to an array without any issues.

1 Like