Strange load (like DDoS attack)

Few min ago one my node got strange load:

I collected information from debug port:
curl -s localhost:7777/mon/ps > /tmp/result.txt

1 Like

but my node is not crashed, now load is dropping:
image

and node ps can show normal output:
curl -s localhost:7777/mon/ps

[4855391130671764778] storj.io/storj/pkg/process.root() (elapsed: 192h47m2.345983103s)
 [3791477542259197051] storj.io/storj/storagenode.(*Peer).Run() (elapsed: 192h47m0.903888308s)
  [4502346184408064397] storj.io/storj/pkg/server.(*Server).Run() (elapsed: 192h47m0.291775956s)
  [8582672752331851760] storj.io/storj/private/version/checker.(*Service).Run() (elapsed: 192h47m0.290965621s)
  [8936912386000918787] storj.io/storj/storagenode/bandwidth.(*Service).Run() (elapsed: 192h47m0.291707171s)
  [7163324842844850360] storj.io/storj/storagenode/collector.(*Service).Run() (elapsed: 192h47m0.291668141s)
  [5567454460225998770] storj.io/storj/storagenode/console/consoleserver.(*Server).Run() (elapsed: 192h47m0.290566934s)
  [6453650888101349660] storj.io/storj/storagenode/contact.(*Chore).Run() (elapsed: 192h47m0.29174229s)
  [6984412994902266878] storj.io/storj/storagenode/gracefulexit.(*Chore).Run() (elapsed: 192h47m0.292354227s)
  [7694086949645767578] storj.io/storj/storagenode/monitor.(*Service).Run() (elapsed: 192h47m0.29232423s)
  [5389737299688781933] storj.io/storj/storagenode/orders.(*Service).Run() (elapsed: 192h47m0.291652855s)
  [244302455947060197] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 192h47m0.292358616s)

[826013095517286765] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 4.749977719s)
 [2067643844467071328] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 4.74995633s)

[6149165099796225338] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 1.951438616s)
 [7390795848746009901] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 1.951387085s)

I found the root cause of this load, it was cleaning the trash out of the bin after 7 days:

used space before:
image

used space after:
image

~740GB was cleaned!

2 Likes

The same situation with load every portion of cleaning trash:

image

It really extremely hiload, I afraid not all nodes can survive.

After 10min:

image

curl -s localhost:7777/mon/ps have a lot of this sections:

[2123516809455615763] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 18m14.504998729s)
 [6009436491625855895] storj.io/storj/storagenode/piecestore.(*Endpoint).doUpload() (elapsed: 18m14.505009388s)
  [717481803337676253] storj.io/storj/storagenode/pieces.(*Writer).Commit() (elapsed: 18m14.403496472s)
   [4603401485507916384] storj.io/storj/storage/filestore.(*blobWriter).Commit() (elapsed: 18m14.403463193s)
    [8489321167678156516] storj.io/storj/storage/filestore.(*Dir).Commit() (elapsed: 18m14.403458266s)

curl -s localhost:7777/mon/ps |grep -c "Writer"
1178

It is really crazy.
Please pay attention!

it can make node almost unresponcible when garbige collector at work on small power nodes.
Desctop pc will win on this. On other side, today garbige collector delete after 7 days, after it will delete file as satellite will tel it to delete and will not be so many ant one time as now.

After 25 min.:
image

I still alive! :slight_smile:
I must survive! :man_dancing:

curl -s localhost:7777/mon/ps |grep -c "Writer"
1660

After 35 min:
image

curl -s localhost:7777/mon/ps |grep -c "Writer"
1890

35 min deleting files?

After 40min:

New alerts:

  • Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

Current alerts:

  • Pool tank state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.

image

It really disaster!

yep, it still running

storj wiping hole node? Looks like with docker also and drives, self distruckt

After 1h, cleaning is finished and load is dropping (peak was 1200), node is survived but I loose one disk!

what configuration you have Raid1

Here is my configuration of than node: RAIDZ1 (8HDD) +L2ARC+ZIL

I will try to reproduce on my test environment, it really interesting problem…

Oh nice so you found a faulty disk!

Did you have a regular scrub set up so you find this kind of situation early?

I had a funny situation once that my drives would fault out whenever they got into a high load. Turned out that I hadn’t considered that my homemade SATA power adapter had high inductance and voltages were dropping by 20%+ during disk seek.

had same problem, changed Sata cables with locks, and power cable. working fine.

Of course, I have configured scrub on this storage and alerting also configured :slight_smile:

This disk was kicked from RAID because the response was is too hi, this disk does not have any issues and SMART not have any issues. It is working fine until this extremely load more than half of year.

But I suspect the root cause is not on storj side, and not on the disk side.
lets me some time to finish my investigations, I will update this topic when I confirm my suspects.