Node shows offline periodically

edstewbob · September 25, 2020, 5:07am

I’ve been running a node for about 3 days on a dedicated PC with a dedicated sata hard drive and have had several occasions where the node shows offline for a few hours. The port is open and I have been monitoring it with an external service that shows it stays open all the time. The only error in the log shows lots of errors with satellite id 118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW but successful access by several other satellites and no other indications why the node periodically shows offline. With no obvious reason the node will then show online and network bandwidth and disk space usage will continue to grow. Any reason for this behavior and should I be concerned about this? This is my first node and I was hoping someone here with more experience would have some insight on what is going on or if there is something else I need to check.

SGC · September 25, 2020, 5:44am

the satellite contact error / time out thing in the logs are irrelevant, that satellite is down for some maintenance, so that can be safely ignored.

i would start by setting up a few endless pings running at towards the router or towards the storagenode and router from a 2rd computer just to track if you have some sort of network issue…

you could do the same for google, but people don’t really like if one fires off to many pings to fast, but i suppose if you have been monitoring your uptime with an external server that would also confirm the internet is working 24/7.

you could have some sort of power management issue, maybe the disk spins down… in the past i don’t think that would put the node offline… but today there are fairly new safeguards to protect the node from self destructing its ratings with the satellites if it looses contact to it’s storage…

would also be quite possible that a brand new node isn’t seem enough traffic atm to avoid power management… atleast some of the time… and those intermittent errors is always hell to find…

so yeah, my money is on power management, either disk, computer (psu) maybe… cpu( doubt it tho) network (not bloody likely) and ofc usb is also very likely to go into power save… so if one is using external hdd’s or such, then that can easily create lots of grief

also keep in mind the hdd has internal power manangement / power save features and the OS most likely has more power saving features it might attempt to apply to the hdd, which can create a wide range of issues.

but the internal hdd power management shouldn’t make the node go offline…

but the OS hdd power save or fullsystem sleep or such can put the node offline…network (wifi) power management can shut down the network while it thinks its idle…

but yeah POWER MANAGEMENT OFF across the board, then go from there, i don’t really have any better ideas right now…

nerdatwork · September 25, 2020, 6:10am

Welcome to the forum @edstewbob!

Are you using Windows or Linux ? When you check the dashboard is it with CLI or web dashboard ?

edstewbob · September 25, 2020, 6:57am

Windows with the web dashboard.

nerdatwork · September 25, 2020, 6:58am

Which browser are you using to see web dashboard ? Also is it docker or GUI ?

edstewbob · September 25, 2020, 7:00am

Thanks I did see that HD power management was turned on so I disabled it and sleep mode was off already and I turned off all other power items that smelled like they could cause some change in system performance or power reduction.

edstewbob · September 25, 2020, 7:01am

Using Chrome browser it is GUI and no docker installed.

nerdatwork · September 25, 2020, 7:02am

Then its most probably HDD being turned off as you said earlier. Keep an eye out for next few hours

edstewbob · September 25, 2020, 7:03am

So in general it should never show offline if everything is functioning normally?

nerdatwork · September 25, 2020, 7:12am

Yes it should be online 24/7 and dashboard should show online, uptime stats among other things.

SGC · September 25, 2020, 7:16am

yeah uptime should be like since last time you had to reboot the host or when the storagenode software was updated…
which is usually anywhere from 1 week to 1 month… usually in the lower end tho… 2 weeks avg maybe a little less

ofc when starting a new node, uptime is the challenge one faces…
there will always be issue… power cables getting unplugged, computer stability issues and configuration, SNO configuration and testing / improvement…

easy to go offline

baker · September 25, 2020, 12:53pm

Does it show offline if you leave the dashboard page open for a while? When you refresh the page, does it then go online? I have noticed that the web GUI will show the node offline if I leave the webpage open and come back to it hours later. A refresh fixes the status. Perhaps this is what you are noticing?

edstewbob · September 25, 2020, 1:25pm

Yes I leave the browser page up all the time. It doesn’t auto refresh so only changes if I click on the refresh button on the top row next to STORJ. It has remained online since I made the change to prevent the disk from powering down due to inactivity. I actually just changed it to power down only after inactive for 9999 minutes as there wasn’t a way to remove it entirely. I’ll keep any eye on it. Thank you for all the quick and helpful replies.

naxbc · September 25, 2020, 5:07pm

Exactly what happened to me. Check NIC power management, windows was shutting NIC off, to save power. When I disabled it, no problems. Running for over 80 hours straight. It wouldn’t go over 4 hours running.

edstewbob · September 25, 2020, 5:54pm

Thanks @naxbc I hadn’t checked that one as it’s not in the Windows power setting area. I had to go into the NIC properties and turn off power management and a couple of other settings there that suggested they might be related to affecting the NIC power. My mini PC with 2TB SATA hard drive only uses 7 watts total so power consumption isn’t something I’m concerned about.

naxbc · September 26, 2020, 12:06am

Glad to help man, hope that solves it

edstewbob · September 26, 2020, 1:33am

Shows offline again for some strange reason after changing everything recommended I’m at a loss for what else could be causing this.

edstewbob · September 26, 2020, 3:23am

The tail end of the log showed this when the Offline status showed up. I’m not sure why but this satellite shows errors frequently throughout the log or whether there is a way to remove it so the errors stop occurring.

nerdatwork · September 26, 2020, 5:20am

That satellite has shut down so you can ignore those errors.

What does it show above those lines ?

edstewbob · September 26, 2020, 5:52am

This is some of the log above those lines after deleted the lines with the error satellite.

2020-09-25T18:59:19.612-0400 INFO piecestore downloaded {“Piece ID”: “5KAU2Q36L5F7WSSACXQ772JOOTPCEHICD24RDDYKPT44AICGWODQ”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET”}
2020-09-25T18:59:38.892-0400 INFO piecestore upload started {“Piece ID”: “OLFBCSINEC5IFHNZASFTK5SDCQQEYW6E3M62XEYGIYPQIHXP6MFQ”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”, “Available Space”: 1692405123072}
2020-09-25T18:59:39.322-0400 INFO piecestore uploaded {“Piece ID”: “OLFBCSINEC5IFHNZASFTK5SDCQQEYW6E3M62XEYGIYPQIHXP6MFQ”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT”}
2020-09-25T18:59:41.313-0400 INFO piecestore download started {“Piece ID”: “OLFBCSINEC5IFHNZASFTK5SDCQQEYW6E3M62XEYGIYPQIHXP6MFQ”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET”}
2020-09-25T18:59:41.695-0400 INFO piecestore downloaded {“Piece ID”: “OLFBCSINEC5IFHNZASFTK5SDCQQEYW6E3M62XEYGIYPQIHXP6MFQ”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET”}
2020-09-25T18:59:57.000-0400 INFO bandwidth Performing bandwidth usage rollups
2020-09-25T19:00:09.132-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs sending {“count”: 22}
2020-09-25T19:00:09.132-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 sending {“count”: 25}
2020-09-25T19:00:09.132-0400 INFO orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB sending {“count”: 177}
2020-09-25T19:00:09.132-0400 INFO orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE sending {“count”: 128}
2020-09-25T19:00:09.132-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S sending {“count”: 18}
2020-09-25T19:00:09.502-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs finished
2020-09-25T19:00:09.606-0400 INFO orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE finished
2020-09-25T19:00:09.842-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 finished
2020-09-25T19:00:10.753-0400 INFO piecestore upload started {“Piece ID”: “H77ATER2Z3KUNKD3VDAR5Z27IZETRLYRBL3WFQ5HDOGRLZVUHPGA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Available Space”: 1692405121792}
2020-09-25T19:00:10.822-0400 INFO piecestore uploaded {“Piece ID”: “H77ATER2Z3KUNKD3VDAR5Z27IZETRLYRBL3WFQ5HDOGRLZVUHPGA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”}
2020-09-25T19:00:11.048-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S finished
2020-09-25T19:00:15.362-0400 INFO piecestore upload started {“Piece ID”: “7BVKKVJ73IFHQCZ5H4PC6GX23Y7MKXT5FCDOWKQYW47FK4EBNQHQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “PUT”, “Available Space”: 1692405119744}
2020-09-25T19:00:15.812-0400 INFO orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB finished
2020-09-25T19:00:15.901-0400 INFO piecestore uploaded {“Piece ID”: “7BVKKVJ73IFHQCZ5H4PC6GX23Y7MKXT5FCDOWKQYW47FK4EBNQHQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “PUT”}
2020-09-25T19:01:03.179-0400 INFO piecestore upload started {“Piece ID”: “AQRIHN3JE2NOGD7O3COAL5FQMJDTAOFLLKBFAR6Y3VGBTPSOX55Q”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Available Space”: 1692405092608}
2020-09-25T19:01:03.568-0400 INFO piecestore uploaded {“Piece ID”: “AQRIHN3JE2NOGD7O3COAL5FQMJDTAOFLLKBFAR6Y3VGBTPSOX55Q”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”}
2020-09-25T19:01:50.216-0400 INFO piecestore upload started {“Piece ID”: “RSIT23FHD4ZSWA2ORRJ64PA5T4UEU2NYMPFK6GLND3TGC3V2PREQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Available Space”: 1692405091584}
2020-09-25T19:01:51.317-0400 INFO piecestore uploaded {“Piece ID”: “RSIT23FHD4ZSWA2ORRJ64PA5T4UEU2NYMPFK6GLND3TGC3V2PREQ”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”}
2020-09-25T19:12:40.496-0400 ERROR contact:service ping satellite failed {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “attempts”: 12, “error”: “ping satellite error: rpc: context deadline exceeded”, “errorVerbose”: “ping satellite error: rpc: context deadline exceeded\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
2020-09-25T19:46:48.512-0400 INFO contact:service retries timed out for this cycle {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”}
20-09-25T19:59:57.001-0400 INFO bandwidth Performing bandwidth usage rollups
2020-09-25T20:00:08.993-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S sending {“count”: 34}
2020-09-25T20:00:08.993-0400 INFO orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB sending {“count”: 152}
2020-09-25T20:00:08.993-0400 INFO orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE sending {“count”: 133}
2020-09-25T20:00:08.993-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 sending {“count”: 17}
2020-09-25T20:00:08.993-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs sending {“count”: 28}
2020-09-25T20:00:09.359-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S finished
2020-09-25T20:00:09.394-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs finished
2020-09-25T20:00:09.711-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 finished
2020-09-25T20:00:09.758-0400 INFO orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE finished
2020-09-25T20:00:21.873-0400 INFO orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB finished
020-09-25T20:59:03.704-0400 INFO contact:service retries timed out for this cycle {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”}
2020-09-25T20:59:56.987-0400 INFO bandwidth Performing bandwidth usage rollups
2020-09-25T21:05:10.991-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs sending {“count”: 1}
2020-09-25T21:05:10.991-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S sending {“count”: 2}
2020-09-25T21:05:10.991-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 sending {“count”: 1}
2020-09-25T21:05:11.193-0400 INFO orders.12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S finished
2020-09-25T21:05:11.389-0400 INFO orders.12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs finished
2020-09-25T21:05:11.718-0400 INFO orders.121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 finished
2020-09-25T21:59:56.996-0400 INFO bandwidth Performing bandwidth usage rollups