Weird behavior around nodes

Hi.
I have a little problem. I started 8 nodes (1TB each) few days ago to check what it is all about. Few days was clean. Today I noticed that 3 of my nodes have online status around 80-93% and audit 95-97%, rest of them have 100%. Nodes are on the same Ubuntu docker server, I’m sure that disks are fine. Next weird situation is that nodes are not balanced like 1 have 30GB of space second 27, third 60, etc etc the last one have 160GB, all nodes run exactly the same time without 7th node that was offline at least 12hours (this node have around 40GB) and after restart there is all at 100% (online, audit status) which is weird. Third problem is that the 7th node just hang out and container restarts didn’t help, I had to clone/reconfigure node to make it work again. Nodes are on separate configs. What I’m missing? What can I check to figure out what is going on?

The docker logs, where you will likely find “readability” or “writability” time-out errors causing a fatal error, crashing some nodes (just an educated guess).

Ouch, that mean done already have disqualified (threshold =96%)?

If you got 95% audit score for a sat then your node had been disqualified for that sat.

Nothing is weird with this. It’s completely normal.

You’re on Windows or Linux?

Hello @Various69,
Welcome to the forum!

This sounds like you used the cloned identity for them. The cloned identity even with a new authorization token is still the same identity but with lost data. If so, all 8 will be DQ pretty soon.

“Ubuntu docker server”

It’s not that, I generated identity for all of them one by one with separate token. It took only 2-3h.

So something else is going on, the audit score can only drop if the data is unavailable, corrupted or lost. To get a DQ for downtime, the node must be offline for more than 30 days. You can use this guide to try to figure out why the audit score dropped:

However, please make sure that you did use each identity for the correct node without duplication - their dashboards should have different NodeIDs.

They are different, just checked to be sure.

1 Like

How about ports assignment? External port must be different if all nodes sharing same ip.

I would scrap everything, literally rm-rf everything.

Then setup One none. One. Make sure you can keep it running for a few weeks.

Then, if you want to, add another node. Monitor both for a week.

It makes no sense to detangle this, too many variables and potential mishaps.

Best approach - write a script to setup a node. Then you can post it here and people can point out any pitfalls.

1 Like

They are different for nodes

There are no mishaps. Whole background is really simple (for me, im IT “specialist” :slight_smile: in my company).
1 PC with ubuntu server, 9 disks (1 system ssd, 8 hdd for nodes), 1 docker instance for one disk (sum 8),

Docker run scripts:

docker run --rm -e SETUP="true" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/home/patryk/Desktop/id/stor1",destination=/app/identity \
    --mount type=bind,source="/mnt/stor1",destination=/app/config \
    --name stor1 storjlabs/storagenode:latest

docker run -d --restart unless-stopped --stop-timeout 300 \
    -p 28967:28967/tcp \
    -p 28967:28967/udp \
    -p 14001:14002 \
    -e WALLET=" \
    -e EMAIL="" \
    -e ADDRESS="my.public.ip:28967" \
    -e STORAGE="1TB" \
    --user $(id -u):$(id -g) \
    --mount type=bind,source="/home/patryk/Desktop/id/stor1",destination=/app/identity \
    --mount type=bind,source="/mnt/stor1",destination=/app/config \
    --name stor1 storjlabs/storagenode:latest

for every next node “stor1” is changeg for +1 so next’s looks like stor2, stor3…
-p 28967:28967 for every node add +1 so second looks like 28968:28967
-P 14001:14002 similar like previous

Ports are redirected on mikrotik router with public ip from ISP. Bandwidth 600/60.

Now i run script from @Bivvo (Howto: storage node health check > discord + email alerting script) to see whats going on from start of live of nodes and im confused. My dashboard from nodes in suspension & audit looks like:

Node 1: Everythink 100%
Node 2: Everythink 100%
Node 3: Everythink 100% except ap1.storj.io:7777 Online 98%
Node 4: Everythink 100%
Node 5: ap1.storj.io:7777 Online 93.75%, us1.storj.io:7777 Online 95.83%, eu1.storj.io:7777 Online 97.5
Node 6: Everythink 100%
Node 7: Like node 5 ±1%
Node 8: Everythink 100%

But from script from Bivvo that i made to analyze whole logs i got:

[stor1] 
.. downloads (canceled: 2.85%, failed: 0.11%, audit: 97%) 

[stor2] 
.. downloads (canceled: 3.87%, failed: 0.10%, audit: 96%) 
.. uploads (canceled: 0.71%, failed: 1.91%, audit: 97%) 

[stor3] 
.. downloads (canceled: 3.04%, failed: 0.14%, audit: 97%) 

[stor4]
no errors

[stor5] 
.. downloads (canceled: 3.26%, failed: 0.17%, audit: 97%) 
.. uploads (canceled: 0.74%, failed: 2.54%, audit: 97%) 

[stor6] 
.. uploads (canceled: 0.75%, failed: 1.84%, audit: 97%) 

[stor7] 
.. downloads (canceled: 3.56%, failed: 0.08%, audit: 96%) 
.. uploads (canceled: 0.72%, failed: 2.02%, audit: 97%)

[stor8] 
.. downloads (canceled: 2.96%, failed: 0.09%, audit: 97%) 
.. uploads (canceled: 0.59%, failed: 2.36%, audit: 97%)

docker logs stor(1, 2, 3...) 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed

^ shows nothink

docker logs stor* 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep started -c & docker logs stor* 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep downloaded -c

^ shows same ammount of audits started and finished

What im missing?

by the online score I think that sometimes your nodes are not responding to request. I don’t see anything wrong with your configs. So it maybe a performance problem. You should check for server load and network connection consistency.

Server load is fine, 4 cores and load all the time is around 5-10%. Disk iops around 10-20mb so I have plenty of room. Idk how to check network, from my perspective it’s fine, I have ping running to Google and not a single packet is missing.

These are no IOPS, what are actual IOPS?

But it does not matter, if disks are overwhelmed you would see audit score dropping when the node drops or times out repair and/or audit requests.

Online score dropping means satellite was unable to contact your node. You won’t find evidence of something not happening in the logs.

The possible reasons to explore, in this order:

  • your DDNS solution. Some are of variable degree of flakiness and occasionally respond with garbage or don’t respond. I recommend cloudflare.
  • Security software on the gateway (variety of anti-ddos, or other suricata-type product may be blocking connections to your node. Turn it off. It’s 100% useless unless you have a dedicated security team to maintain the rules.
  • Router software – maybe running out of open connections, or otherwise unhappy.

My original recommendation still stands. Start with one node and stabilize it. Then add more one by one. There is no reason to add 10 nodes at once – they all share traffic.

1 Like

That 60mbps drops your requests to upload pieces (downloads from client) and audits.
That’s your problem.
When the testing was high, I had a limit of 150mbps that dropped my scores also.

2 Likes

Good catch. I see another two problems here:

1 Like

This must be executed only once for each identity and never again, otherwise you may destroy the node by making a mistake in the path.

I would suggest to move each identity to its disk with data, they are useless without each other anyway, and much less mistakes like mixing identity with a wrong disk in multinode setup.

Perhaps docker logs already rotated, the local logs driver has 5 files by 20MiB each by default, and 100MiB is not enough for the historic research. I would recommend to either to redirect logs to the file (you would like also to configure a logrotate service) or increase the amount and the logs size for docker.

I mean disk bandwidth, read/write.

Ok but why saltlake.tardigrade.io:7777 is everywhere online and rest are random?

I don’t use DDNS I have static IP from network provider not nat’ed.
I don’t use fancy firewall rules that could be blocking storj just standard ddos and ICMP flod protection but that don’t affect storj (I have full view in mikrotik on what is blocked and from start of node life there is nothing blocked on lists). And I am that “security team :smile:”. Router is totally fine around 15-20% load during normal work day.