I have set it to 75.
Do you See some drop in Traffic?
My problem is, that my nodes keep crashing in a loop and restarting. The disk is then at 100% bracause of the filewalker.
Can i disable the filewalker at startup somehow? It causes big troubles for me now
All my disks are basically dead:
I see a lot of āupload rejectedā and my hard drive at 100%.
My logic is: if the HDD is rejecting uploads, means it canāt keep up with demand, so why accept more?
If it can deal with 10, it shouldnāt have āupload rejectedā.
Hello,
I just saw that only one node is performing the tests (120mbit peak) and itās one at 1.104.5.
So the question is why 2 of the remaining 4 , all in 1.102.3, running 1.102.3 crash.
Moreover no ingress traffic is accepted, itās coming but race lost.
Regarding fine tuning of config, do you have some suggestion to limit parallel ingress traffic ?
Does anyone tested it ?
Thanks
In the config file, you can change the following line:
# storage2.piece-scan-on-startup: true
to:
storage2.piece-scan-on-startup: false
This should prevent the filewalker at start up from running. Keep in mind that if the used space filewalker is disabled, then various bugs can accumulate wrong values for used space, trash, etc.
As long as you got plenty of free space, then its not really a problem. It only really matters when the storagenode believes that it is full, when in reality it has not used all of its assigned disk space.
Well you have asked:
And thatās why I have answered, make the nodes smarter and automatically adapt upload concurrency to overall node load, disk load or successrate or whatever. The SNOs reported that reducing that number has helped them.
With the new load monitoring that was recently implemented, you might be able to make the node react on high load situations dynamically. That could help.
Guys,
let me update you and share my complains here.
After the automatic upgrade of all nodes to 1.104.2 issue disappeared and all nodes worked fine with low load and doing tests up to 600mbit.
This happens for days
Yesterday I found few nodes crashed again and I noticed that you migrated to 1.104.5. I restarted and this morning three nodes was down again.
I restarted and I can see nodes overloaded.
What are you doing ? Do you test the software before releasing ?
Is there a way to force my nodes back to 1.104.2 with docker-compose ?
Thanks
This passive-aggressive question is not helpful. They test the software, yes.
There donāt seem to be widespread complaints about stability with 104.5 so have you considered that the problem might lie with your setup?
@jammerdan has been trying to troubleshoot issues with garbage collection but, as far as I understand it, those issues donāt seem to impact on node stability.
Is there anything non-standard about your nodes (ie. deviating from the āsimpleā one-node-per-HDD, 1-core-per-node, no-virtualisation setup)?
Are there any error messages in logs that might point to the cause of the crash?
Yes,
you are right , Iām passive-aggressive but Iām stressed to take care daily, each hour, of the environment. It should just be up & running without my continuos love.
There is nothing in my environment, itās for sure not powerful but can do 600mbit during the tests ! it worked fine with 1.104.2 !
Do you know the TAG for docker to use again 1.104.2 ?
The issue is that the nodes are overloaded of requests ⦠I can see it from the logs that scrolls down very fast, you canāt read nothing ā¦
Thanks
Nobody in the history of the world has ever become less stressed by someone telling them not to stress but⦠donāt stress! Your nodes are lucky to have your love
Are your HDDs CMR or SMR?
I believe the first run on 104.5 triggers a garbage collection and file walker to fix some inconsistencies with garbage size reporting so that may be why your machine is more stressed than usual.
Are these nodes all running on the same machine?
Just seeing how fast the logs scroll is not a great indication of whatās happening. Is there anything on the last entry of the logs before the node crashes?
I have five different nodes.
The disks are NAS capable, Ironwolf, WD Red, X18, X20 etc. The filesystems have LVM caching enabled using 500GB SSD. Nodes are a pi4 and others 1 core cpu.
For sure the nodes canāt substain this kind of loads because maybe itās in parallel ⦠tests + garbage collection. I donāt know
Are they red or red plus/pro? red is SMR.
No, Wdred are not all smr List of WD CMR and SMR hard drives (HDD) ā NAS Compares
All current WD red lineup is SMR. The whole reason WD got flak was because they quietly slipped SMR into the red line. Whatās your exact model number?
Last week we uploaded artifical large files. Now the test is much more realistic with smaller file sizes. Not the version makes the difference. The files that are getting uploaded are different.
Donāt we still just get a 4MB piece anyway? And surely a larger file mean sequential writes which should be easier to handle?
Iām sorry if I am completely misunderstanding the fundamentals of how this worksā¦
Hello,
just an update.
Now all is stable and I can see the tests running with 1.2gbit traffic incoming (sum of two nodes in parallel) . This morning for 2 hours I run only as download node reducing the available space, now Iām fully working.
So let me remark that the problem is not the disks or my environment but something that run in parallel and cause an overload : Iām not saying that maybe having more ram, more cores or better disks could be better ! Iām just saying that maybe the software could be more kindly trying to not overload nodes.
Why now I can see 1.2gbit per second and this morning and past night my nodes crashed ?
Thanks
Have you enabled lazy file walker and reduce concurrency on bloom filter processing?
That might help if your nodes get hammered at startup.
No,
I have only lazy file walker enabled.
How can I reduce the concurrency on bloom filter ?
PS Iām using linux with docker-compose
Thanks
version: ā3.3ā
services:
storagenode:
image: storjlabs/storagenode:latest
container_name: storagenode
volumes:
- type: bind
source: /STORJ/STORJ/identity/storagenode
target: /app/identity - type: bind
source: /STORJ/STORJ
target: /app/config - type: bind
source: /var/STORJ_LOCAL
target: /app/dbs - type: bind
source: /var/STORJ_LOCAL/LOG
target: /app/config/LOG
ports: - 28967:28967/tcp
- 28967:28967/udp
- 14002:14002
restart: unless-stopped
stop_grace_period: 300s - log-opt max-size=50m
- log-opt max-file=5
sysctls:
net.ipv4.tcp_fastopen: 3
environment: - WALLET=0xXXXX
- EMAIL=XXX
- ADDRESS=XXXXt:28967
- STORAGE=14800GB
- STORJ_PIECES_ENABLE_LAZY_FILEWALKER=true
- STORJ_STORAGE2_PIECE_SCAN_ON_STARTUP=false
- STORJ_OPERATOR_WALLET_FEATURES=zksync
I am not entirely sure about docker compose, but you can change the following on config.yaml
retain.concurrency: 1
The current default is 5.
Some SNOs have seen a significant improvement by doing this.