Questions about readability and writability intervals and timeouts

snorkel · January 24, 2024, 2:22pm

I’m trying to understand what exactely these parameters do, how they influence the node’s score with the satellites and how much we can change them:

      --storage2.monitor.verify-dir-readable-interval duration   how frequently to verify the location and readability of the storage directory (default 1m0s)
      --storage2.monitor.verify-dir-readable-timeout duration    how long to wait for a storage directory readability verification to complete (default 1m0s)
      --storage2.monitor.verify-dir-writable-interval duration   how frequently to verify writability of storage directory (default 5m0s)
      --storage2.monitor.verify-dir-writable-timeout duration    how long to wait for a storage directory writability verification to complete (default 1m0s)

I want to:

minimize the workload, by increasing the intervals, and…
prevent fatal errors, by increasing the timeouts.

Are these assumptions right?
How much can we increase the intervals? How much can we increase the timeouts?
Is it OK if I set them like this?
Do I get any penalties? Do I increase any system problems or bottlenecks?

storage2.monitor.verify-dir-readable-interval: 3m0s
storage2.monitor.verify-dir-readable-timeout: 3m0s
storage2.monitor.verify-dir-writable-interval: 5m0s
storage2.monitor.verify-dir-writable-timeout: 5m0s

What if I put them all at 60m0s?

CutieePie · January 24, 2024, 2:45pm

Changing any of those is dangerous, they are there to protect the node from being Suspended or disqualified when the data store is unresponsive on unavailable during GET_AUDIT, GET_REPAIR which both impact your scores from Satellites. For instance, you could have a failing hard disk, that is really slow to read a file as it’s having seek issues - If you set the timeout too high, then this will not become apparent so you can filesystem check the disk in time before bad failure.

The issue in setting these values are not linear - the impact increases with the size of the node, as it will be audited and requested for repair more, as it has more data.

On a brand new node, you could easily get away with setting the interval to 60m, and keep the timeouts below 5m

On a 1TB node+ I personally wouldn’t want to risk it, and would keep all intervals below 5m, and all timeouts below 3m.

What I’m not sure on, without looking in the code is what the current impact is on the StorageNode if the verify fails - I have a feeling it use to cause the Storagenode to restart, and then get stuck restarting. Maybe now, it tells the satellite the node is offline, until interval has passed and it rechecks.

But it would be interesting to get an official line on this, I’m just pointing out that I don’t fiddle with those ones

#edit : ok, so looks like default is to kill the node if these checks fail.

github.com

storj/storj/blob/f4ec1a78ff1f2c30993ae50658a9b119ee10b4ba/storagenode/monitor/monitor.go#L56C2-L56C159


      
          	VerifyDirWarnOnly         bool          `help:"if the storage directory verification check fails, log a warning instead of killing the node" default:"false"`

clement · January 24, 2024, 3:59pm

I will just add a little to what @CutieePie rightly said.

They do not directly affect the node scores but are precautionary services to prevent your scores from dropping when you have any drive failures. It warns you (or kills the node) if the drive is unresponsive or too slow.

It’s fine if you want to change the check intervals (i.e storage2.monitor.verify-dir-readable-interval and storage2.monitor.verify-dir-writable-interval) to 60m0s but the timeouts should not be too high. Too high timeouts will cause the node to hang almost “forever”, fail audits fast and that would get you disqualified

If it takes a node to write/read a piece of data for more than 60s, it isn’t very helpful to the network.

snorkel · January 24, 2024, 5:46pm

Untill now, with 1 node of 14TB per machine, I didn’t got any fatals with stock settings, but I just added new 22TB drives and I don’t know how they will behave.
I will put intervals to 5m and timeouts to 2m just to be on the safe+confort side. I don’t check my nodes too often, and Synology sends me emails if there are HDD health issues, so I don’t expect to be caught by surprise when they start to fail.