Sudden drop in Egress and Audit

Mjdaran · May 14, 2024, 2:25pm

My new node (45 days) was running smoothly. Suddenly a couple of days ago, the Egress started dropping and now the audit score is dropping as well.

I tried updating the node - no improvement
I tried restarting the node - nothing either

Ingress is good but Egress is just poor. I don’t know how to fix this. Audit score is what worrying me. Audit score is at 99.5 now.

PS. I have zero downtime till now.

Need help and suggestions.

pasatmalo · May 14, 2024, 4:04pm

While the egress does appear to have slumped, it looks to me more likely to be a change of customer behaviour rather than an issue with your node. One of my own nodes of a similar age (135GB stored) has seen 12GB of egress this month so far, close enough to your egress. That peak might simply have been a customer downloading a big file. Also, without knowing the actual size of the node its hard to get a ballpark figure of how much egress you should have seen. There are lots of other factors that affect egress: Node size, Node location, Customer location, Node performance, Customer behaviour, etc.

For the audit score, just to make sure, are you seeing the 99.5% in the “Online” section or the “Audit” section of the sattelites?

If it is a 99.5% online score, it is fine. While you say you have had 0 downtime up to now, it is possible that during a reboot (I can see the node uptime is 43h) an audit was missed. As the node is young, a single missed audit can have a moderate impact on the online score. A 99.5% online score is perfectly fine (100% would be better of course, has to be above 60% to avoid suspension).

If the 99.5% is the “Audit” score, then that can be an actual issue. Going below 98% can result in a DQ. To check why these audits have failed, please search the logs. Any failed audits should appear with why it failed (e.g. did not find a file, did not respond in time, etc). If this is the case and the node is failing data audits, it is very important to figure out what is going wrong and fix it before its too late.

Knowledge · May 14, 2024, 7:17pm

99.5 Audit isnt an issue, it may be a blip with your ISP. If it continues to go down, let us know. Otherwise it should go back up after a while.

Mjdaran · May 15, 2024, 1:29am

Thanks for the reply. My online score is 100. Audit is 99.5. Yes, I restarted the node to see if it fixes the issue with the audit. I ll go through the logs and see any uncertainty.

Mjdaran · May 15, 2024, 1:29am

Sure. Will go through the logs and monitor if it further dips. Thanks

Mjdaran · May 15, 2024, 8:29am

Found something in the log. Log file is about 1.8GB, don’t know if it’s normal. But I found out a lot of race lost or node shutdown errors while uploading.

One more thing node restart on its own frequently now.

pasatmalo · May 15, 2024, 11:09am

While races lost are normal and fine, having that many also indicate an underlying issue/bottleneck.

Could you please provide more information about your current setup?

In addition, search the logs for other errors that may relate to the audits. If you see an “ERROR” on upload/download and the log shows “unexpected EOF” that is also (as far as I know) fine and common. We are looking for any other logs that may indicate why your node is restarting, or why the audits are failing.

Mjdaran · May 15, 2024, 11:23am

I ll see more logs. My set up is

I have 3 nodes running in this pc. This node is the oldest (45 days). And the other 2 nodes are 30 days old. Those 2 nodes doesn’t have any issues. They both have very low Ingress and Egress but they are consistent and audit is 100.

PC spec

Ryzen 3900X
64GB RAM
OS installed in 2TB nvme
3X 16TB HDD

Knowledge · May 15, 2024, 8:56pm

There network filters nodes by subnet so having 3 nodes wont gain you more data than having a single node. If you have multiple drives it is typically best to wait until one is close to full before starting another node.

That said, if these three nodes are working against the same drive you may be having issues with the I/O keeping up with all the activity the nodes generate.

Mjdaran · May 16, 2024, 11:07am

Three nodes have separate HDDs. But I do note some kind of I/O errors for the same HDD in the event viewer. I have restarted the entire PC now and all the nodes are back to their best. Ingress is healthy and so does Egress. I expect Audit to be back to 100 soon. Lets wait and see.

pasatmalo · May 16, 2024, 12:52pm

If you have not already, it might be a good idea to run a S.M.A.R.T. test to get an overview of the health of your drive. If the drive is failing, you could always migrate to a new drive before that happens so you dont lose the progress you already have.

Alexey · May 18, 2024, 9:42am

With all respect, falled audit score is an issue, no matter how small. This is mean, that your node is failed an audit, either a missed piece or corrupted. This will not heal itself, the satellite may audit this piece over and over again, until it wouldn’t be deleted from your node by the customer.
So, this is an issue.
@Mjdaran
You must figure out, what’s going on by analyzing your logs:

You need to check your logs for failed GET_AUDIT and GET_REPAIR, failed uploads/downloads doesn’t matter for your node’s reputation.

Alexey · May 18, 2024, 9:45am

@Mjdaran it’s also the break of Supplier Terms & Conditions
You must not run more than a one node on the single drive.

Knowledge · May 18, 2024, 2:25pm

Fair enough. In one thread we are told that in some cases a failed audit over a length of time due to caching failure is fine. In another it is not fine to fail any audit. We need to be consistent in our messaging. If the fsync is disabled and end users are occasionally going to fail an audit when they reset unexpectedly, do we still think any failed audit is a problem and needs investigating?

It’s an honest question because with fsync disabled now we are going to get plenty of SNO’s who are likely to see a 99.5% audit score.

Alexey · May 18, 2024, 2:55pm

I’m sure that any failed audit is a problem. fsync didn’t add much, if the hardware is unstable.

Mjdaran · May 25, 2024, 2:14am

I think I messed up all the nodes now. At first I noticed low egress in only 1 node (K:/). Checked many logs and I thought the problem would be 100% active time for that particular node. That sums up the overload and audit issue.

So I tried changing usb ports for that particular node. Restarted the PC. Unfortunately the other 2 nodes won’t start because of older version. So I had to manually copy updated…exe file from working node and replace. Now all the nodes are working but all the nodes are running at 100% active time. I must have done something.

Thought my pc could handle all those nodes. Being a Ryzen 9, I once ran more than 30 HDD for Chia.

Alexey · May 25, 2024, 2:49am

The limitation usually not CPU, but the disk controller and the disks itself or the used filesystem.
The USB connection is known as slow, so I would expect that your services could stop with a readability or a writability timeouts.

The 100% disk load is expected, since nodes runs a used-space-filewalker on start, there also could be a trash cleanup filewalker and if your nodes are “lucky”, then also a garbage collector in the same time. And here is also a load from the customers on top of it.

Mjdaran · May 25, 2024, 3:09am

Then, I guess I have to wait it out and see. Hope the node lasts longer.

Alexey · May 25, 2024, 3:36am

If it would fail with either of timeouts, you may try to fix it, see