Node overloaded - traffic flooding

I have set it to 75.
Do you See some drop in Traffic?

My problem is, that my nodes keep crashing in a loop and restarting. The disk is then at 100% bracause of the filewalker.
Can i disable the filewalker at startup somehow? It causes big troubles for me now

All my disks are basically dead:

I see a lot of ā€œupload rejectedā€ and my hard drive at 100%.
My logic is: if the HDD is rejecting uploads, means it canā€™t keep up with demand, so why accept more?
If it can deal with 10, it shouldnā€™t have ā€œupload rejectedā€.

2 Likes

Hello,
I just saw that only one node is performing the tests (120mbit peak) and itā€™s one at 1.104.5.
So the question is why 2 of the remaining 4 , all in 1.102.3, running 1.102.3 crash.
Moreover no ingress traffic is accepted, itā€™s coming but race lost.

Regarding fine tuning of config, do you have some suggestion to limit parallel ingress traffic ?
Does anyone tested it ?

Thanks

In the config file, you can change the following line:

# storage2.piece-scan-on-startup: true

to:

storage2.piece-scan-on-startup: false

This should prevent the filewalker at start up from running. Keep in mind that if the used space filewalker is disabled, then various bugs can accumulate wrong values for used space, trash, etc.
As long as you got plenty of free space, then its not really a problem. It only really matters when the storagenode believes that it is full, when in reality it has not used all of its assigned disk space.

3 Likes

Well you have asked:

And thatā€™s why I have answered, make the nodes smarter and automatically adapt upload concurrency to overall node load, disk load or successrate or whatever. The SNOs reported that reducing that number has helped them.
With the new load monitoring that was recently implemented, you might be able to make the node react on high load situations dynamically. That could help.

1 Like

Guys,
let me update you and share my complains here.
After the automatic upgrade of all nodes to 1.104.2 issue disappeared and all nodes worked fine with low load and doing tests up to 600mbit.
This happens for days
Yesterday I found few nodes crashed again and I noticed that you migrated to 1.104.5. I restarted and this morning three nodes was down again.
I restarted and I can see nodes overloaded.
What are you doing ? Do you test the software before releasing ?
Is there a way to force my nodes back to 1.104.2 with docker-compose ?

Thanks

1 Like

This passive-aggressive question is not helpful. They test the software, yes.

There donā€™t seem to be widespread complaints about stability with 104.5 so have you considered that the problem might lie with your setup?
@jammerdan has been trying to troubleshoot issues with garbage collection but, as far as I understand it, those issues donā€™t seem to impact on node stability.

Is there anything non-standard about your nodes (ie. deviating from the ā€œsimpleā€ one-node-per-HDD, 1-core-per-node, no-virtualisation setup)?

Are there any error messages in logs that might point to the cause of the crash?

5 Likes

Yes,
you are right , Iā€™m passive-aggressive but Iā€™m stressed to take care daily, each hour, of the environment. It should just be up & running without my continuos love.
There is nothing in my environment, itā€™s for sure not powerful but can do 600mbit during the tests ! it worked fine with 1.104.2 !
Do you know the TAG for docker to use again 1.104.2 ?

The issue is that the nodes are overloaded of requests ā€¦ I can see it from the logs that scrolls down very fast, you canā€™t read nothing ā€¦

Thanks

Nobody in the history of the world has ever become less stressed by someone telling them not to stress butā€¦ donā€™t stress! Your nodes are lucky to have your love :slight_smile:

Are your HDDs CMR or SMR?

I believe the first run on 104.5 triggers a garbage collection and file walker to fix some inconsistencies with garbage size reporting so that may be why your machine is more stressed than usual.

Are these nodes all running on the same machine?

Just seeing how fast the logs scroll is not a great indication of whatā€™s happening. Is there anything on the last entry of the logs before the node crashes?

I have five different nodes.
The disks are NAS capable, Ironwolf, WD Red, X18, X20 etc. The filesystems have LVM caching enabled using 500GB SSD. Nodes are a pi4 and others 1 core cpu.
For sure the nodes canā€™t substain this kind of loads because maybe itā€™s in parallel ā€¦ tests + garbage collection. I donā€™t know

Are they red or red plus/pro? red is SMR.

No, Wdred are not all smr List of WD CMR and SMR hard drives (HDD) ā€“ NAS Compares

All current WD red lineup is SMR. The whole reason WD got flak was because they quietly slipped SMR into the red line. Whatā€™s your exact model number?

Last week we uploaded artifical large files. Now the test is much more realistic with smaller file sizes. Not the version makes the difference. The files that are getting uploaded are different.

2 Likes

Donā€™t we still just get a 4MB piece anyway? And surely a larger file mean sequential writes which should be easier to handle?

Iā€™m sorry if I am completely misunderstanding the fundamentals of how this worksā€¦

1 Like

Hello,
just an update.

Now all is stable and I can see the tests running with 1.2gbit traffic incoming (sum of two nodes in parallel) . This morning for 2 hours I run only as download node reducing the available space, now Iā€™m fully working.
So let me remark that the problem is not the disks or my environment but something that run in parallel and cause an overload : Iā€™m not saying that maybe having more ram, more cores or better disks could be better ! Iā€™m just saying that maybe the software could be more kindly trying to not overload nodes.
Why now I can see 1.2gbit per second and this morning and past night my nodes crashed ?

Thanks

Have you enabled lazy file walker and reduce concurrency on bloom filter processing?
That might help if your nodes get hammered at startup.

No,
I have only lazy file walker enabled.
How can I reduce the concurrency on bloom filter ?
PS Iā€™m using linux with docker-compose

Thanks

version: ā€œ3.3ā€
services:
storagenode:
image: storjlabs/storagenode:latest
container_name: storagenode
volumes:

  • type: bind
    source: /STORJ/STORJ/identity/storagenode
    target: /app/identity
  • type: bind
    source: /STORJ/STORJ
    target: /app/config
  • type: bind
    source: /var/STORJ_LOCAL
    target: /app/dbs
  • type: bind
    source: /var/STORJ_LOCAL/LOG
    target: /app/config/LOG
    ports:
  • 28967:28967/tcp
  • 28967:28967/udp
  • 14002:14002
    restart: unless-stopped
    stop_grace_period: 300s
  • log-opt max-size=50m
  • log-opt max-file=5
    sysctls:
    net.ipv4.tcp_fastopen: 3
    environment:
  • WALLET=0xXXXX
  • EMAIL=XXX
  • ADDRESS=XXXXt:28967
  • STORAGE=14800GB
  • STORJ_PIECES_ENABLE_LAZY_FILEWALKER=true
  • STORJ_STORAGE2_PIECE_SCAN_ON_STARTUP=false
  • STORJ_OPERATOR_WALLET_FEATURES=zksync

I am not entirely sure about docker compose, but you can change the following on config.yaml

retain.concurrency: 1

The current default is 5.
Some SNOs have seen a significant improvement by doing this.