Node overloaded - traffic flooding

unrealSpeedy · May 17, 2024, 7:07am

I have set it to 75.
Do you See some drop in Traffic?

MarviBiene · May 17, 2024, 7:08am

My problem is, that my nodes keep crashing in a loop and restarting. The disk is then at 100% bracause of the filewalker.
Can i disable the filewalker at startup somehow? It causes big troubles for me now

All my disks are basically dead:

naxbc · May 17, 2024, 7:29am

I see a lot of “upload rejected” and my hard drive at 100%.
My logic is: if the HDD is rejecting uploads, means it can’t keep up with demand, so why accept more?
If it can deal with 10, it shouldn’t have “upload rejected”.

mcanto73 · May 17, 2024, 7:57am

Hello,
I just saw that only one node is performing the tests (120mbit peak) and it’s one at 1.104.5.
So the question is why 2 of the remaining 4 , all in 1.102.3, running 1.102.3 crash.
Moreover no ingress traffic is accepted, it’s coming but race lost.

Regarding fine tuning of config, do you have some suggestion to limit parallel ingress traffic ?
Does anyone tested it ?

Thanks

pasatmalo · May 17, 2024, 8:53am

In the config file, you can change the following line:

# storage2.piece-scan-on-startup: true

to:

storage2.piece-scan-on-startup: false

This should prevent the filewalker at start up from running. Keep in mind that if the used space filewalker is disabled, then various bugs can accumulate wrong values for used space, trash, etc.
As long as you got plenty of free space, then its not really a problem. It only really matters when the storagenode believes that it is full, when in reality it has not used all of its assigned disk space.

jammerdan · May 19, 2024, 11:23am

Well you have asked:

And that’s why I have answered, make the nodes smarter and automatically adapt upload concurrency to overall node load, disk load or successrate or whatever. The SNOs reported that reducing that number has helped them.
With the new load monitoring that was recently implemented, you might be able to make the node react on high load situations dynamically. That could help.

mcanto73 · May 21, 2024, 5:46am

Guys,
let me update you and share my complains here.
After the automatic upgrade of all nodes to 1.104.2 issue disappeared and all nodes worked fine with low load and doing tests up to 600mbit.
This happens for days
Yesterday I found few nodes crashed again and I noticed that you migrated to 1.104.5. I restarted and this morning three nodes was down again.
I restarted and I can see nodes overloaded.
What are you doing ? Do you test the software before releasing ?
Is there a way to force my nodes back to 1.104.2 with docker-compose ?

Thanks

ACarneiro · May 21, 2024, 6:05am

This passive-aggressive question is not helpful. They test the software, yes.

There don’t seem to be widespread complaints about stability with 104.5 so have you considered that the problem might lie with your setup?
@jammerdan has been trying to troubleshoot issues with garbage collection but, as far as I understand it, those issues don’t seem to impact on node stability.

Is there anything non-standard about your nodes (ie. deviating from the “simple” one-node-per-HDD, 1-core-per-node, no-virtualisation setup)?

Are there any error messages in logs that might point to the cause of the crash?

mcanto73 · May 21, 2024, 6:43am

Yes,
you are right , I’m passive-aggressive but I’m stressed to take care daily, each hour, of the environment. It should just be up & running without my continuos love.
There is nothing in my environment, it’s for sure not powerful but can do 600mbit during the tests ! it worked fine with 1.104.2 !
Do you know the TAG for docker to use again 1.104.2 ?

The issue is that the nodes are overloaded of requests … I can see it from the logs that scrolls down very fast, you can’t read nothing …

Thanks

ACarneiro · May 21, 2024, 6:54am

Nobody in the history of the world has ever become less stressed by someone telling them not to stress but… don’t stress! Your nodes are lucky to have your love

Are your HDDs CMR or SMR?

I believe the first run on 104.5 triggers a garbage collection and file walker to fix some inconsistencies with garbage size reporting so that may be why your machine is more stressed than usual.

Are these nodes all running on the same machine?

Just seeing how fast the logs scroll is not a great indication of what’s happening. Is there anything on the last entry of the logs before the node crashes?

mcanto73 · May 21, 2024, 6:59am

I have five different nodes.
The disks are NAS capable, Ironwolf, WD Red, X18, X20 etc. The filesystems have LVM caching enabled using 500GB SSD. Nodes are a pi4 and others 1 core cpu.
For sure the nodes can’t substain this kind of loads because maybe it’s in parallel … tests + garbage collection. I don’t know

Mitsos · May 21, 2024, 7:02am

Are they red or red plus/pro? red is SMR.

Roberto · May 21, 2024, 8:33am

No, Wdred are not all smr List of WD CMR and SMR hard drives (HDD) – NAS Compares

Mitsos · May 21, 2024, 8:41am

All current WD red lineup is SMR. The whole reason WD got flak was because they quietly slipped SMR into the red line. What’s your exact model number?

littleskunk · May 21, 2024, 11:37am

Last week we uploaded artifical large files. Now the test is much more realistic with smaller file sizes. Not the version makes the difference. The files that are getting uploaded are different.

ACarneiro · May 21, 2024, 11:53am

Don’t we still just get a 4MB piece anyway? And surely a larger file mean sequential writes which should be easier to handle?

I’m sorry if I am completely misunderstanding the fundamentals of how this works…

mcanto73 · May 21, 2024, 12:24pm

Hello,
just an update.

Now all is stable and I can see the tests running with 1.2gbit traffic incoming (sum of two nodes in parallel) . This morning for 2 hours I run only as download node reducing the available space, now I’m fully working.
So let me remark that the problem is not the disks or my environment but something that run in parallel and cause an overload : I’m not saying that maybe having more ram, more cores or better disks could be better ! I’m just saying that maybe the software could be more kindly trying to not overload nodes.
Why now I can see 1.2gbit per second and this morning and past night my nodes crashed ?

Thanks

ACarneiro · May 21, 2024, 12:29pm

Have you enabled lazy file walker and reduce concurrency on bloom filter processing?
That might help if your nodes get hammered at startup.

mcanto73 · May 21, 2024, 12:45pm

No,
I have only lazy file walker enabled.
How can I reduce the concurrency on bloom filter ?
PS I’m using linux with docker-compose

Thanks

version: “3.3”
services:
storagenode:
image: storjlabs/storagenode:latest
container_name: storagenode
volumes:

type: bind
source: /STORJ/STORJ/identity/storagenode
target: /app/identity
type: bind
source: /STORJ/STORJ
target: /app/config
type: bind
source: /var/STORJ_LOCAL
target: /app/dbs
type: bind
source: /var/STORJ_LOCAL/LOG
target: /app/config/LOG
ports:
28967:28967/tcp
28967:28967/udp
14002:14002
restart: unless-stopped
stop_grace_period: 300s
log-opt max-size=50m
log-opt max-file=5
sysctls:
net.ipv4.tcp_fastopen: 3
environment:
WALLET=0xXXXX
EMAIL=XXX
ADDRESS=XXXXt:28967
STORAGE=14800GB
STORJ_PIECES_ENABLE_LAZY_FILEWALKER=true
STORJ_STORAGE2_PIECE_SCAN_ON_STARTUP=false
STORJ_OPERATOR_WALLET_FEATURES=zksync

ACarneiro · May 21, 2024, 12:52pm

I am not entirely sure about docker compose, but you can change the following on config.yaml

retain.concurrency: 1

The current default is 5.
Some SNOs have seen a significant improvement by doing this.