Loosing audit score on EU1 and EU-North

So this node has been running fine until now. I have ruled out HDD failure as it is in raid. I noticed it this morning I have no idea what it might be causing. I`m including some ERROR lines from the log some of you might be able to find out what is wrong I had no luck thus far

Any help will be much appreciated

2021-09-05T11:08:56.195+0200 WARN contact:service Your node is still considered to be online but encountered an error. {“Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Error”: “contact: failed to dial storage node (ID: 1snKVGtVgNaKzR2mZfReMUr4bep9U6eH947rLwbFzsk3u2y8mV) at address 185.140.—.–:28967 using QUIC: rpc: quic: timeout: no recent network activity”}

2021-09-05T11:08:50.911+0200 ERROR collector unable to delete piece {“Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Piece ID”: “K5WD2PCUDWMBIAJE6BBOTC52HDZDRKEPJYBW34ZBRANXJIMUQZ3Q”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:103\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:239\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Delete:220\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:299\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

This is nothing to worry about. If you forward the UDP port to the node (same port as TCP) it will no longer give this warning message.

This means (self evidently) that the file flagged for deletion doesn’t exist or at least can’t be read.

If you don’t have any error messages related to audit, it’s possible your node isn’t responding in time to audit requests. Do you see any GET_AUDIT download start messages with corresponding completion messages? You could count both to see if they match.

I would still recommend a file system check. Just because the disks are in raid doesn’t necessarily mean the file system is without errors.

Ye to be honest I’m starting to suspect that again
So I’m going to have to repair it hopefully the down time will be only 1H
Tho the audit improves massively

Hopefully it isn’t something fatal

Thanks again

1 Like

Ok so something else weird is happening.
Storj node seems to just get stuck and stops communicating and since it doesn’t stop windows is unable to restart it. I’m currently searching for a solution
As to the audit it seems that that all is good and well

A small update the issue has been found. I run my node in 1U dell r210 server
Purposefully chose an L class CPU to decrease power draw and more importantly the Heat output
After some investigation it seems to be overheating which is weird as I run it the same way for a month and no issues were before

to the subject of overheating I sleep in the same room as my node is in so during the night I have the fans turned off it seems if I don’t cool down the server once a day it will overheat I’m currently setting up the scheduler to enable the fans while I’m out of the house so when I come back they are quiet

Can’t you just set them to an inaudible speed? Off seems extreme and unwise.

well it is a 1U server not the quietest, but Im currently acquiring moded noctua fans for the server which will be able to run quietly

This is turning into some sort of nightrmare
I wake up and the storage node is offline
I check the logs and I only see that last log was around 5 hours ago with no error I scan the entire log file for fatal error non
so my storage node outright halted and didn’t restart for no reason. This time no overheating, repaired the disk array and even reinstalled the storage node software it self.
I genuinely don’t know what to do next
BTW I’m currently waiting for new HDD to migrate this node on to, but it will take a few weeks to get here.

Yeah, those are annoying issues. Maybe check event viewer to see if something weird is going on at the time the node stops.

1 Like

Did you manage to make it online?

The halting node means hardware problems. Looks like repeated overheat was not went unnoticed.

I need to check Windows logs but even after all fixes it keeps halting for the past few hour I have been running the fans but one thing i noticed the ram useage is growing over time like to 1.5gb I am looking into an option to have the node restart every 3-4hours because after restart it runs fine for about 5-6 and than I keep loosing connection until it just halts
I’ll see how it goes on today and will let you know

If you keep turning off the fans, not only storagenode will halting soon.
And I’m afraid the damage is already made with previous overheating…

This is a bad idea. Every time you restart the node it starts the file walker over again. Doing this would have it constantly thrash the storage.

This is usually a sign of IO not being able to keep up. If you’re not using an SMR drive and you’ve had file system issues before, it might be that your HDD is on its last legs. If the case didn’t have any cooling this could easily have contributed to the HDD failing. While HDD’s don’t need much cooling themselves, they are built to be used in a case that gets rid of excess heat from other components. If everything is cooking in each others heat for long periods of time… well, that’s not good at all. Even having a single fan exhausting on lowest speed can usually prevent this, but having no fans spinning at all is a recipe for trouble.

Also keep in mind that heat sinks made to be used with fans have much tighter heat sink fin spacing, because they assume a fan will be pushing air through anyway. This spacing however blocks most convection cooling, because that needs more space between fins to have any significant effect. This is why passive coolers tend to have thicker fins and wider spacing.

Ye it’s interesting because the HDD I have are meant to work in temperatures between 5-50c at least that’s what the specs said. So I have a warning set up in case the HDD over heats. As to the CPU with fans off and storagenode running it has never thermal throttled

I’m currently looking into all possibilities I put a rush order on new HDD just in case the node is full as it is so I want it to grow
Worst case I have a backup machine in case I can migrate over to that

Good news the issue has been identified, One of the HDD seems to be on the leave. after looking at the S/N i found out it was one of the oldest ones I have and has a few years of uptime on it. The other one is fine. The good is that they are in RAID 1 so the data is safe. I will remove the bad one and remove it from raid. I remember this havening in past with one of my other HDD with windows raid. Because when one slowed down the entire array does so too.
It took me a while to identify as most of my systems run hardware raid on dedicated pcie cards so this was an unusual behavior.
Thanks for the quick identification of the dieing HDD @BrightSilence

1 Like