Filewalker running twice simulatenously?

Stob · August 2, 2022, 10:46am

With all the talk of the filewalker service running slowly I’ve decided to monitor more closely as I know it’s a very slow process for my node. Yesterday just a single WalkNamespace() process was running but today I see that it’s running twice?

From the mon/ps page…

[641491950405662978] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 26h16m57.3032232s, orphaned)
 [4234253337856310446] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 26h16m57.3032085s)
  [6861140452394329368] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 26h16m57.3028626s)
   [6181857365076502411] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 26h16m57.302855s)
    [5502574277758675454] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 26h16m57.3028459s)

[4430599288717823130] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 26h16m57.3031257s, orphaned)
 [5454704730586751640] storj.io/storj/storagenode/retain.(*Service).retainPieces(storj.NodeID{0xf4, 0x74, 0x53, 0x5a, 0x19, 0xdb, 0x0, 0xdb, 0x4f, 0x80, 0x71, 0xa1, 0xbe, 0x6c, 0x25, 0x51, 0xf4, 0xde, 0xd6, 0xa6, 0xe3, 0x8f, 0x8, 0x18, 0xc6, 0x8c, 0x68, 0xd0, 0x0, 0x0, 0x0, 0x0}, time.Time(2022-08-01T21:49:59.904267156Z)) (elapsed: 26h15m24.214296468s)
  [503377168500450258] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 26h15m24.214265568s)
   [4775421643268924683] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 26h15m24.214258368s)
    [9047466118037399109] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 26h15m24.214251968s)

Toyoo · August 2, 2022, 10:50am

Two different reasons. The calling code is different and IIRC there are no locks around them.

Stob · August 2, 2022, 10:53am

Thanks, I understand it’s two different reasons but are they fighting each other, i.e. both being slowed down by walking different parts of the drive at the same time?

Toyoo · August 2, 2022, 3:11pm

I don’t remember the code that well, but if my memory is right, they will actually fight each other. On the other hand, if you have enough RAM to cache visited metadata, at some point they’ll reuse the cache filled by the other process. Wouldn’t be a perfect situation, but under some assumptions should be good enough.

Stob · August 3, 2022, 8:25am

Well it’s now running 4 times…

[641491950405662978] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 47h58m31.3499126s, orphaned)
 [4234253337856310446] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 47h58m31.3498979s)
  [7880692022555742283] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 48h9m36.603366844s)
   [2929364460469440901] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 48h9m36.603353444s)
    [7201408935237915326] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 48h9m36.603347244s)

[4430599288717823130] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 47h58m31.3498151s, orphaned)
 [3296997160821971726] storj.io/storj/storagenode/retain.(*Service).retainPieces(storj.NodeID{0xa2, 0x8b, 0x4f, 0x4, 0xe1, 0xb, 0xae, 0x85, 0xd6, 0x7f, 0x4c, 0x6c, 0xb8, 0x2b, 0xf8, 0xd4, 0xc0, 0xf0, 0xf4, 0x7a, 0x8e, 0xa7, 0x26, 0x27, 0x52, 0x4d, 0xeb, 0x6e, 0xc0, 0x0, 0x0, 0x0}, time.Time(2022-08-01T21:20:15.653133291Z)) (elapsed: 48h1m20.484198405s)
  [7569041635590446151] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 48h1m20.484187605s)
   [2617714073504144769] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 48h1m20.484170305s)
    [6889758548272619194] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 48h1m20.484164605s)
 [5454704730586751640] storj.io/storj/storagenode/retain.(*Service).retainPieces(storj.NodeID{0xf4, 0x74, 0x53, 0x5a, 0x19, 0xdb, 0x0, 0xdb, 0x4f, 0x80, 0x71, 0xa1, 0xbe, 0x6c, 0x25, 0x51, 0xf4, 0xde, 0xd6, 0xa6, 0xe3, 0x8f, 0x8, 0x18, 0xc6, 0x8c, 0x68, 0xd0, 0x0, 0x0, 0x0, 0x0}, time.Time(2022-08-01T21:49:59.904267156Z)) (elapsed: 47h56m58.260985868s)
  [503377168500450258] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 47h56m58.260954968s)
   [4775421643268924683] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 47h56m58.260947768s)
    [9047466118037399109] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 47h56m58.260941368s)
 [5919885932743187971] storj.io/storj/storagenode/retain.(*Service).retainPieces(storj.NodeID{0x84, 0xa7, 0x4c, 0x2c, 0xd4, 0x3c, 0x5b, 0xa7, 0x65, 0x35, 0xe1, 0xf4, 0x2f, 0x5d, 0xf7, 0xc2, 0x87, 0xed, 0x68, 0xd3, 0x35, 0x22, 0x78, 0x2f, 0x4a, 0xfa, 0xbf, 0xdb, 0x40, 0x0, 0x0, 0x0}, time.Time(2022-08-01T21:15:34.490419388Z)) (elapsed: 48h10m26.601729569s)
  [968558370656886588] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePieces() (elapsed: 48h10m26.601715269s)
   [5240602845425361014] storj.io/storj/storage/filestore.(*Dir).WalkNamespace() (elapsed: 48h10m26.601663069s)
    [289275283339059631] storj.io/storj/storage/filestore.(*Dir).walkNamespaceInPath() (elapsed: 48h10m26.601649969s)

This must be counterproductive

littleskunk · August 3, 2022, 1:53pm

There is a setting for it. --retain.concurrency int how many concurrent retain requests can be processed at the same time. (default 5)

Stob · August 3, 2022, 2:16pm

Thanks. I have enabled and then adjusted the retain.concurrency value to 1. I will wait for the next automatic/scheduled restart before seeing how this affects the node.

jammerdan · August 4, 2022, 4:23am

How can I get this output?

Stob · August 4, 2022, 8:50am

I’m looking at the monkit ps and funcs output…

For those with Windows nodes, I hadn’t set the port for the debug.addr but found it using TCPView. Considering how much I’m checking it, it would make sense to now fix the port.

Stob · August 4, 2022, 8:56am

Follow-up on this point, I’m an idiot for changing the config and not testing it. My node was offline for 12 hours as I mistyped the change in retain.concurrency value, which only came to light when the storagenode updater tried to trigger an update. The node is up and running again, with the change, so we will see the result…

jammerdan · August 4, 2022, 9:22am

Thanks, I haven’t tried that yet.