Red Audit on dashboard -- Why?

thereverendd · March 5, 2023, 3:16pm

I am a bit confused. My server has been running 24/7 for the last 2 months no outages etc. but my one Suspension & audit shows 87.5% . This is the 1st time this has happened what does it mean and will it go back to 100% and how long will that take and what adverse effect does it have on my node?

GollyTicker · March 5, 2023, 7:47pm

Can you check at the process/docker-container how long it has been running? It could also be, that your ISP had a temporariy hickup at some time. But getting to 87.5% within for the last 30 days isn’t likely like this.

Do you monitor you CPU/RAM/Disk usage? Perhaps the server is overburdened and sometimes cannot serve the requests.

Toyoo · March 5, 2023, 8:20pm

One example reason that was already reported here on the forums was when the storage node operator’s router had some “smart firewall” set up that suddenly decided to block connections for one specific satellite.

us1 and eu1 are the two satellites that provide most income to storage node operators, and so either of them having any hiccup like that is a worrisome observation.

thereverendd · March 5, 2023, 8:48pm

Not that. 256 GB of ram and 24 Processors on a DELL server not likely a memory or processor issue

thereverendd · March 5, 2023, 8:49pm

No smart router causing this. All is working 100% right now and if it was a firewall issue then it would have stopped it and would not be allowing it. Any other ideas? And how do you get back to 100%

snorkel · March 5, 2023, 10:13pm

Check logs. Stop node, check drives, memory and databases for errors. Disable DDOS protection in router and maybe server, if there is any.

Pac · March 5, 2023, 10:44pm

I would like to point out that what’s red is not “Audit” but your “Online” score (with such an Audit score, you’d be disqualified on this satellite already).
It means this satellite could not contact your node 12.5% of the time for the past 30 days.

If the issue is resolved, the score will likely stay like this for a while, and then recover eventually as it is the average of a 30 day moving window.

If not, it will keep getting worse until your node gets suspended (and disqualified later down the line if the issue isn’t solved).

Some possible reasons were cited above. We’ve seen unreliable DynDNS services as well lately.

Have a look at your node’s logs to begin with and search for warnings/errors. It never hurts to check… If nothing suspicious stands out (except for connectivity issues) you may have to check each network element between your node and sat’ : OS config & behavior, router options, DynDNS services, ISP, …

MattJE96011 · March 5, 2023, 11:56pm

Although everything everyone has said already is good and it’s definitely worth checking everything out, in my experience sometimes it just happens for seemingly no reason. I have had the same issues at times with no identifiable issues. I run enterprise equipment without any noticable bottlenecks other than the typical hard drive IO which there’s not much you can do about, and still see this sometimes.

Although it seems totally random accross all my nodes from time to time, I have noticed it seems to be more common and with a larger percentage drop for younger / smaller nodes. When I notice a drop like this on one node, I often check others and they’re typically fine sitting at 100%. If I do actually have legit downtime, the larger nodes will drop pretty accurately relative to the downtime but the smaller ones either don’t even notice the downtime at all or the score drops much more than it should. I’ve seen as low as 30% for only an hour or 2 offline with a 1 month old node. I just assume nodes with more data are checked more often. I’ve even run multiple nodes on the same drives, both single drive as well as zfs pools and again, one node can show downtime and the others on the same drive do not.

So… I’ve basically just come to the conclusion that it just happens sometimes for reasons outside our control, and it tends to be much more noticable on small / young nodes. Of course it’s still good to check on things to make sure though, but if all looks good and the score doesn’t continue to drop your probably fine.

thereverendd · March 7, 2023, 2:30am

Ok so just when I thought it could not get worse. My power was out now for 24 hours and just came back on. I had battery backup on the server but it lasted like 2 hrs then it gave out. I hope that I can recover from this. Can I ask what happens if your NODE gets disqualified can you re-start all over again? or is your IP address black listed?

jammerdan · March 7, 2023, 3:14am

No but your nodes identiy. So you have to start from the scratch with a fresh node without data and a new identity.

Ruskiem · March 7, 2023, 3:44am

C’mon You will not be put out just because of that or by random 24h offline time, it takes quite some more for that. First if You have DDNS by changeip.com then change it asap to better one, like Dynu.com(4 free ip’s) or cloudns.net(1 ip free) or even noip.com (3 ip’s free, but that need clicking every 1 month from mail to renew.) i bet its DDNS problems, like my, it went away as ssoon i changed provider, and my online stats were recovering by the day up and up, and its 99,5% now.

thereverendd · March 7, 2023, 12:01pm

I am now getting an error on my server that says it is mis-configured

I checked my router for port 28967 as it is saying is not forwarded but the router in fact IS forwarded. I am using NOIP for my service. This all happened after i restarted the server. Not sure whats happening here.

Knowledge · March 7, 2023, 3:53pm

QUIC is checked when you start your node. If UDP isn’t being forwarded properly, you will get that misconfigured issue. Before panic, I would try restarting your node and see if it clears up. It’s possible when you restarted the dynamic DNS wasn’t pointing to the correct IP and it reported that your QUIC was misconfigured.

You won’t get disqualified unless your node is down or not responding for about 12 days. This is different than suspension. If you node is offline and below 60% your node will stop receiving data, but assuming everything is working properly, it will slowly rise above 60% and back to 100% after a 30 day moving window.

thereverendd · March 10, 2023, 1:10pm

I did notice that my nodes Bandwidth Usage has SLOWED to almost nothing lately Is this a result of the reds ?? also the disk space used has not gone up very much either.

Knowledge · March 10, 2023, 3:47pm

If they are below 60% you be suspended from receiving new data until they rise above that threshold. You would still be able to deliver egress though. You can always check your logs to see if you have errors that would prevent you from getting data.

thereverendd · March 10, 2023, 3:51pm

I am getting some but not like i was before this happened.

Knowledge · March 10, 2023, 6:37pm

I would recommend checking your logs to see if there are any errors or issues.

thereverendd · March 10, 2023, 6:48pm

How can i check the logs? I am still quite new to this thank you

Knowledge · March 10, 2023, 6:49pm

That will give you different ways depending on your platform.

thereverendd · March 10, 2023, 6:57pm

Seems like there was only 1 error in the last 20 entries the one below as follows… I edited it to keep some info safe :

2023-03-10T18:52:24.198Z ERROR contact:service ping satellite failed {“Process”: “storagenode”, “Satellite ID”: “xxx”, “attempts”: 6, “error”: “ping satellite: failed to ping storage node, your node indicated error code: 0, manager closed: read tcp AN IP I DONT KNOW:37672->MY IP:28967: read: connection reset by peer”, “errorVerbose”: “ping satellite: failed to ping storage node, your node indicated error code: 0, manager closed: read tcp Unknown IPP:37672->MYIP:28967: read: connection reset by peer\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:147\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:101\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}

Hope this helps…