Weird behavior around nodes

Various69 · September 19, 2024, 7:13pm

Hi.
I have a little problem. I started 8 nodes (1TB each) few days ago to check what it is all about. Few days was clean. Today I noticed that 3 of my nodes have online status around 80-93% and audit 95-97%, rest of them have 100%. Nodes are on the same Ubuntu docker server, I’m sure that disks are fine. Next weird situation is that nodes are not balanced like 1 have 30GB of space second 27, third 60, etc etc the last one have 160GB, all nodes run exactly the same time without 7th node that was offline at least 12hours (this node have around 40GB) and after restart there is all at 100% (online, audit status) which is weird. Third problem is that the 7th node just hang out and container restarts didn’t help, I had to clone/reconfigure node to make it work again. Nodes are on separate configs. What I’m missing? What can I check to figure out what is going on?

Stez · September 19, 2024, 7:44pm

The docker logs, where you will likely find “readability” or “writability” time-out errors causing a fatal error, crashing some nodes (just an educated guess).

JWvdV · September 19, 2024, 8:15pm

Ouch, that mean done already have disqualified (threshold =96%)?

nyancodex · September 19, 2024, 10:59pm

If you got 95% audit score for a sat then your node had been disqualified for that sat.

Nothing is weird with this. It’s completely normal.

You’re on Windows or Linux?

Alexey · September 20, 2024, 4:39am

Hello @Various69,
Welcome to the forum!

This sounds like you used the cloned identity for them. The cloned identity even with a new authorization token is still the same identity but with lost data. If so, all 8 will be DQ pretty soon.

Various69 · September 20, 2024, 5:10am

“Ubuntu docker server”

Various69 · September 20, 2024, 5:12am

It’s not that, I generated identity for all of them one by one with separate token. It took only 2-3h.

Alexey · September 20, 2024, 5:49am

So something else is going on, the audit score can only drop if the data is unavailable, corrupted or lost. To get a DQ for downtime, the node must be offline for more than 30 days. You can use this guide to try to figure out why the audit score dropped:

However, please make sure that you did use each identity for the correct node without duplication - their dashboards should have different NodeIDs.

Various69 · September 20, 2024, 2:57pm

They are different, just checked to be sure.

alpharabbit · September 20, 2024, 3:41pm

How about ports assignment? External port must be different if all nodes sharing same ip.

arrogantrabbit · September 20, 2024, 3:57pm

I would scrap everything, literally rm-rf everything.

Then setup One none. One. Make sure you can keep it running for a few weeks.

Then, if you want to, add another node. Monitor both for a week.

It makes no sense to detangle this, too many variables and potential mishaps.

Best approach - write a script to setup a node. Then you can post it here and people can point out any pitfalls.

Various69 · September 20, 2024, 4:12pm

They are different for nodes

Various69 · September 20, 2024, 5:27pm

There are no mishaps. Whole background is really simple (for me, im IT “specialist” in my company).
1 PC with ubuntu server, 9 disks (1 system ssd, 8 hdd for nodes), 1 docker instance for one disk (sum 8),

Docker run scripts:

docker run --rm -e SETUP="true" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/home/patryk/Desktop/id/stor1",destination=/app/identity \
    --mount type=bind,source="/mnt/stor1",destination=/app/config \
    --name stor1 storjlabs/storagenode:latest

docker run -d --restart unless-stopped --stop-timeout 300 \
    -p 28967:28967/tcp \
    -p 28967:28967/udp \
    -p 14001:14002 \
    -e WALLET=" \
    -e EMAIL="" \
    -e ADDRESS="my.public.ip:28967" \
    -e STORAGE="1TB" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/home/patryk/Desktop/id/stor1",destination=/app/identity \
    --mount type=bind,source="/mnt/stor1",destination=/app/config \
    --name stor1 storjlabs/storagenode:latest

for every next node “stor1” is changeg for +1 so next’s looks like stor2, stor3…
-p 28967:28967 for every node add +1 so second looks like 28968:28967
-P 14001:14002 similar like previous

Ports are redirected on mikrotik router with public ip from ISP. Bandwidth 600/60.

Now i run script from @Bivvo (Howto: storage node health check > discord + email alerting script) to see whats going on from start of live of nodes and im confused. My dashboard from nodes in suspension & audit looks like:

Node 1: Everythink 100%
Node 2: Everythink 100%
Node 3: Everythink 100% except ap1.storj.io:7777 Online 98%
Node 4: Everythink 100%
Node 5: ap1.storj.io:7777 Online 93.75%, us1.storj.io:7777 Online 95.83%, eu1.storj.io:7777 Online 97.5
Node 6: Everythink 100%
Node 7: Like node 5 ±1%
Node 8: Everythink 100%

But from script from Bivvo that i made to analyze whole logs i got:

[stor1] 
.. downloads (canceled: 2.85%, failed: 0.11%, audit: 97%) 

[stor2] 
.. downloads (canceled: 3.87%, failed: 0.10%, audit: 96%) 
.. uploads (canceled: 0.71%, failed: 1.91%, audit: 97%) 

[stor3] 
.. downloads (canceled: 3.04%, failed: 0.14%, audit: 97%) 

[stor4]
no errors

[stor5] 
.. downloads (canceled: 3.26%, failed: 0.17%, audit: 97%) 
.. uploads (canceled: 0.74%, failed: 2.54%, audit: 97%) 

[stor6] 
.. uploads (canceled: 0.75%, failed: 1.84%, audit: 97%) 

[stor7] 
.. downloads (canceled: 3.56%, failed: 0.08%, audit: 96%) 
.. uploads (canceled: 0.72%, failed: 2.02%, audit: 97%)

[stor8] 
.. downloads (canceled: 2.96%, failed: 0.09%, audit: 97%) 
.. uploads (canceled: 0.59%, failed: 2.36%, audit: 97%)

docker logs stor(1, 2, 3...) 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed

^ shows nothink

docker logs stor* 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep started -c & docker logs stor* 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep downloaded -c

^ shows same ammount of audits started and finished

What im missing?

nyancodex · September 20, 2024, 8:19pm

by the online score I think that sometimes your nodes are not responding to request. I don’t see anything wrong with your configs. So it maybe a performance problem. You should check for server load and network connection consistency.

Various69 · September 20, 2024, 10:06pm

Server load is fine, 4 cores and load all the time is around 5-10%. Disk iops around 10-20mb so I have plenty of room. Idk how to check network, from my perspective it’s fine, I have ping running to Google and not a single packet is missing.

arrogantrabbit · September 21, 2024, 12:45am

These are no IOPS, what are actual IOPS?

But it does not matter, if disks are overwhelmed you would see audit score dropping when the node drops or times out repair and/or audit requests.

Online score dropping means satellite was unable to contact your node. You won’t find evidence of something not happening in the logs.

The possible reasons to explore, in this order:

your DDNS solution. Some are of variable degree of flakiness and occasionally respond with garbage or don’t respond. I recommend cloudflare.
Security software on the gateway (variety of anti-ddos, or other suricata-type product may be blocking connections to your node. Turn it off. It’s 100% useless unless you have a dedicated security team to maintain the rules.
Router software – maybe running out of open connections, or otherwise unhappy.

My original recommendation still stands. Start with one node and stabilize it. Then add more one by one. There is no reason to add 10 nodes at once – they all share traffic.

snorkel · September 21, 2024, 2:12am

That 60mbps drops your requests to upload pieces (downloads from client) and audits.
That’s your problem.
When the testing was high, I had a limit of 150mbps that dropped my scores also.

arrogantrabbit · September 21, 2024, 2:48am

Good catch. I see another two problems here:

bufferbloat on upstream. Op needs to enable SQM (fq_codel, smart queues, “gaming mode”, whatever your gateway calls it). Use this test from wired connection to confirm. https://www.waveform.com/tools/bufferbloat
On 60Mbps you are allowed to run only two nodes to stay within the minimum requirements. https://support.storj.io/hc/en-us/articles/360026612272-What-are-the-requirements-for-a-Storage-Node-on-V3. The downstream traffic is aggregated in the /24 network but upstream isn’t.

Alexey · September 21, 2024, 5:23am

Various69:

docker run --rm -e SETUP="true" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/home/patryk/Desktop/id/stor1",destination=/app/identity \
    --mount type=bind,source="/mnt/stor1",destination=/app/config \
    --name stor1 storjlabs/storagenode:latest

This must be executed only once for each identity and never again, otherwise you may destroy the node by making a mistake in the path.

I would suggest to move each identity to its disk with data, they are useless without each other anyway, and much less mistakes like mixing identity with a wrong disk in multinode setup.

Perhaps docker logs already rotated, the local logs driver has 5 files by 20MiB each by default, and 100MiB is not enough for the historic research. I would recommend to either to redirect logs to the file (you would like also to configure a logrotate service) or increase the amount and the logs size for docker.

Various69 · September 21, 2024, 12:29pm

I mean disk bandwidth, read/write.

Ok but why saltlake.tardigrade.io:7777 is everywhere online and rest are random?

I don’t use DDNS I have static IP from network provider not nat’ed.
I don’t use fancy firewall rules that could be blocking storj just standard ddos and ICMP flod protection but that don’t affect storj (I have full view in mikrotik on what is blocked and from start of node life there is nothing blocked on lists). And I am that “security team ”. Router is totally fine around 15-20% load during normal work day.