Node hiccupped. Anything I need to do

EddieA · June 3, 2025, 1:48am

Hi

Noticed my Audit on one of my nodes took a very tiny hit. Not really bothered about that per se, but went looking to see if I could see why. Found the following in my logs:

2025-05-22T23:19:53-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyWritability:184\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:114\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:20:09-07:00       WARN    servers service takes long to shutdown  {"Process": "storagenode", "name": "server"}
2025-05-22T23:20:09-07:00       WARN    services        service takes long to shutdown  {"Process": "storagenode", "name": "gracefulexit:chore"}
2025-05-22T23:20:12-07:00       INFO    services        slow shutdown   {"Process": "storagenode", "stack": "goroutine 1614\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace.func1:107\n\tsync.(*Once).doSlow:78\n\tsync.(*Once).Do:69\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace:104\n\tstorj.io/storj  ** remainder of massive line removed
2025-05-22T23:20:14-07:00       INFO    servers slow shutdown   {"Process": "storagenode", "stack": "goroutine 1550\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace.func1:107\n\tsync.(*Once).doSlow:78\n\tsync.(*Once).Do:69\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace:104\n\tstorj.io/storj/private  ** remainder of massive line removed
2025-05-22T23:20:23-07:00       ERROR   gracefulexit:chore      error retrieving satellites.    {"Process": "storagenode", "error": "satellitesdb: context canceled", "errorVerbose": "satellitesdb: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits.func1:201\n\tstorj.io/storj/storagenode/storagenodedb.(*satellitesDB).ListGracefulExits:213\n\tstorj.io/storj/storagenode/gracefulexit.(*Service).ListPendingExits:59\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).AddMissing:55\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/gracefulexit.(*Chore).Run:48\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:20:54-07:00       ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyWritability:184\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:114\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:20:54-07:00       FATAL   Unrecoverable error     {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyWritability:184\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:114\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

At this point, it looks like the node automatically restarted, based on log messages:

2025-05-22T23:22:21-07:00       INFO    Configuration loaded    {"Process": "storagenode", "Location": "/mnt/storagenode/data/config.yaml"}
2025-05-22T23:22:21-07:00       INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2025-05-22T23:22:22-07:00       INFO    Operator email  {"Process": "storagenode", "Address": "you.don't.need.to.know"}
2025-05-22T23:22:22-07:00       INFO    Operator wallet {"Process": "storagenode", "Address": "0x................................"}
2025-05-22T23:22:42-07:00       INFO    server  existing kernel support for server-side tcp fast open detected  {"Process": "storagenode"}
2025-05-22T23:22:44-07:00       INFO    hashstore       hashstore opened successfully   {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "open_time": "234.844547ms"}
2025-05-22T23:22:44-07:00       INFO    hashstore       hashstore opened successfully   {"Process": "storagenode", "satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "open_time": "49.344599ms"}
2025-05-22T23:22:48-07:00       INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd"}

2025-05-22T23:22:48-07:00       INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd"}
2025-05-22T23:22:48-07:00       INFO    db.migration    Database Version        {"Process": "storagenode", "version": 62}
2025-05-22T23:22:51-07:00       INFO    preflight:localtime     start checking local system clock with trusted satellites' system clock.        {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    preflight:localtime     local system clock is in sync with trusted satellites' system clock.    {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    trust   Scheduling next refresh {"Process": "storagenode", "after": "7h33m8.323930054s"}
2025-05-22T23:22:51-07:00       INFO    Node 1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd started {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    Public server started on [::]:28967     {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    Private server started on 127.0.0.1:7778        {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    bandwidth       Persisting bandwidth usage cache to db  {"Process": "storagenode"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "235.996µs"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2025-05-22T23:22:51-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-05-22T23:22:51-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "56.441226ms"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "elapsed": "119.254µs"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-05-22T23:22:51-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "70.654µs"}
2025-05-22T23:23:10-07:00       INFO    orders  finished        {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "count": 9}
2025-05-22T23:23:12-07:00       INFO    orders  finished        {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "count": 1010}
2025-05-22T23:23:12-07:00       INFO    orders  finished        {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "count": 1312}
2025-05-22T23:23:24-07:00       INFO    orders  finished        {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "count": 53596}
2025-05-22T23:27:29-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 5513, "Successful Audits": 5497, "Audit Score": 1, "Online Score": 0.9938142770719903, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-22T23:27:29-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 41470, "Successful Audits": 41407, "Audit Score": 1, "Online Score": 0.9988376844503829, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-22T23:27:30-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 55688, "Successful Audits": 55642, "Audit Score": 1, "Online Score": 0.9995757646082547, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-22T23:27:30-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 5832, "Successful Audits": 5820, "Audit Score": 1, "Online Score": 0.9985889479905438, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-22T23:32:51-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-22T23:42:51-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-22T23:52:51-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-22T23:57:01-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   filewalker      failed to store the last batch of prefixes      {"Process": "storagenode", "error": "filewalker: context canceled", "errorVerbose": "filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:78\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:129\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: filewalker: context canceled; used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: filewalker: context canceled; used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:181\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: filewalker: context canceled; used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: filewalker: context canceled; used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:181\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-05-22T23:57:11-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-05-22T23:57:11-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-05-22T23:57:11-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2025-05-22T23:57:11-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2025-05-22T23:57:11-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2025-05-22T23:57:11-07:00       ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:11-07:00       ERROR   piecestore:cache        encountered error while computing space used by satellite       {"Process": "storagenode", "error": "filewalker: used_space_per_prefix_db: context canceled", "errorVerbose": "filewalker: used_space_per_prefix_db: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*usedSpacePerPrefixDB).Get:81\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatelliteWithWalkFunc:96\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:83\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkAndComputeSpaceUsedBySatellite:783\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:85\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "SatelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2025-05-22T23:57:11-07:00       ERROR   piecestore:cache        error getting current used space for trash:     {"Process": "storagenode", "error": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled", "errorVerbose": "filestore error: failed to walk trash namespace af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000: context canceled\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).SpaceUsedForTrash:302\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run.func1:104\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:12-07:00       ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-05-22T23:57:12-07:00       FATAL   Unrecoverable error     {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

Again, it appears the node restarted itself, but this time there were no more errors thrown:

2025-05-22T23:57:16-07:00       INFO    Configuration loaded    {"Process": "storagenode", "Location": "/mnt/storagenode/data/config.yaml"}
2025-05-22T23:57:16-07:00       INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2025-05-22T23:57:16-07:00       INFO    Operator email  {"Process": "storagenode", "Address": "you.don't.need.to.know"}
2025-05-22T23:57:16-07:00       INFO    Operator wallet {"Process": "storagenode", "Address": "0x................................"}
2025-05-22T23:57:19-07:00       INFO    server  existing kernel support for server-side tcp fast open detected  {"Process": "storagenode"}
2025-05-22T23:57:20-07:00       INFO    hashstore       hashstore opened successfully   {"Process": "storagenode", "satellite": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "open_time": "115.352969ms"}
2025-05-22T23:57:20-07:00       INFO    hashstore       hashstore opened successfully   {"Process": "storagenode", "satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "open_time": "54.173687ms"}
2025-05-22T23:57:21-07:00       INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd"}
2025-05-22T23:57:21-07:00       INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd"}
2025-05-22T23:57:21-07:00       INFO    db.migration    Database Version        {"Process": "storagenode", "version": 62}
2025-05-22T23:57:22-07:00       INFO    preflight:localtime     start checking local system clock with trusted satellites' system clock.        {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    preflight:localtime     local system clock is in sync with trusted satellites' system clock.    {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    trust   Scheduling next refresh {"Process": "storagenode", "after": "6h47m26.34787941s"}
2025-05-22T23:57:22-07:00       INFO    Node 1R9kwKBqceNdCvY6sDmjNDvsDjsjPXnYkHM8P4qNZmYt4sQeMd started {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    Public server started on [::]:28967     {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    Private server started on 127.0.0.1:7778        {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    bandwidth       Persisting bandwidth usage cache to db  {"Process": "storagenode"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-05-22T23:57:22-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-05-22T23:57:22-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "26.84673ms"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "elapsed": "156.963µs"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "57.282µs"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2025-05-22T23:57:22-07:00       INFO    pieces:trash    emptying trash finished {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "68.309µs"}
2025-05-23T00:00:48-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Total Audits": 5513, "Successful Audits": 5497, "Audit Score": 1, "Online Score": 0.9938142770719903, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-23T00:00:49-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Total Audits": 41470, "Successful Audits": 41407, "Audit Score": 1, "Online Score": 0.9988376844503829, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-23T00:00:49-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Total Audits": 55787, "Successful Audits": 55736, "Audit Score": 1, "Online Score": 0.9995757646082547, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-23T00:00:49-07:00       INFO    reputation:service      node scores updated     {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Total Audits": 5840, "Successful Audits": 5828, "Audit Score": 1, "Online Score": 0.9985889479905438, "Suspension Score": 1, "Audit Score Delta": 0, "Online Score Delta": 0, "Suspension Score Delta": 0}
2025-05-23T00:07:22-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-23T00:15:03-07:00       INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "Total Pieces Size": 3864819676762, "Total Pieces Content Size": 3857872854618, "Total Pieces Count": 13568012, "Duration": "17m41.12852924s"}
2025-05-23T00:15:03-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2025-05-23T00:17:22-07:00       INFO    piecemigrate:chore      all enqueued for migration; will sleep before next pooling      {"Process": "storagenode", "active": {}, "interval": "10m0s"}
2025-05-23T00:18:31-07:00       INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Lazy File Walker": false, "Total Pieces Size": 76142548992, "Total Pieces Content Size": 76046517248, "Total Pieces Count": 187562, "Duration": "3m27.191898822s"}
2025-05-23T00:18:31-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2025-05-23T00:23:21-07:00       INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Lazy File Walker": false, "Total Pieces Size": 225347145472, "Total Pieces Content Size": 224998125824, "Total Pieces Count": 681679, "Duration": "4m50.210442834s"}
2025-05-23T00:23:21-07:00       INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2025-05-23T00:24:12-07:00       INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 1853449216, "Total Pieces Content Size": 1844134912, "Total Pieces Count": 18192, "Duration": "50.825243664s"}

Is there anything else I need to do at this point, or was everything cleaned up correctly.

Cheers.

mike · June 3, 2025, 4:53am

Nice digging!

This certainly seems to be the trouble.
Usually it’s an indication that the storage is busy and cannot complete its buildin writetest within 1 minute.

Most plausible underlying issues/causes:

other workloads spending IO resources on disk (could be ok, thus accept the reputation hit)
Hardware issue with disk/controller.. (check disk for errors)

If you are satisfied the disk/hardware is fine, no action required. It fixed itself with the restarts.

Also the filewalker runtime of ~30 minutes for a 4-5TB node is quite nice. So generally performance on the disk is good.

EasyRhino · June 4, 2025, 6:18am

when the load is really high (like during last year’s test data flood) I can have lots of “context canceled” errors for things like filewalkers or even garbage collection failing. restarting the node will have it eventually try again and eventually they finish.

The standard description around here for the context canceled message is that the storage system is too slow to keep up. Personally I’ve seen the message less than 1 second after a task starts, so my own belief is something in either storj or the go language is being too picky. but I’m not smart enough to debug.

Now the unrecoverable error where the writeability couldn’t be confirmed… well that means your storage directory disappeared or went read only for a bit. That could be like a hardware or OS problem. But if it doesn’t repeat then… not really a problem.

EddieA · June 19, 2025, 8:09pm

This morning I had another occurrence of this failure:

2025-06-19T10:11:46-07:00       ERROR   services        unexpected shutdown of a runner {"Process": "storagenode", "name": "piecestore:monitor", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T10:12:52-07:00       WARN    servers service takes long to shutdown  {"Process": "storagenode", "name": "server"}
2025-06-19T10:12:54-07:00       INFO    servers slow shutdown   {"Process": "storagenode", "stack": "goroutine 936\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace.func1:107\n\tsync.(*Once).doSlow:78\n\tsync.(*Once).Do:69\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace:104\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:77\n\truntime/pprof.Do:51\n\ngoroutine 1\n\tsync.runtime_SemacquireWaitGroup:110\n\tsync.(*WaitGroup).Wait:118\n\tgolang.org/x/sync/errgroup.(*Group).Wait:56\n\tstorj.io/storj/storagenode.(*Peer).Run:1099\n\ *** massive line truncated ***
2025-06-19T10:12:52-07:00       WARN    servers service takes long to shutdown  {"Process": "storagenode", "name": "server"}
2025-06-19T10:12:54-07:00       INFO    servers slow shutdown   {"Process": "storagenode", "stack": "goroutine 936\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace.func1:107\n\tsync.(*Once).doSlow:78\n\tsync.(*Once).Do:69\n\tstorj.io/storj/private/lifecycle.(*Group).logStackTrace:104\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func1:77\n\truntime/pprof.Do:51\n\ngoroutine 1\n\tsync.runtime_SemacquireWaitGroup:110\n\tsync.(*WaitGroup).Wait:118\n\tgolang.org/x/sync/errgroup.(*Group).Wait:56\n\tstorj.io/storj/storagenode.(*Peer) *** massive line truncated ***
2025-06-19T10:14:26-07:00       ERROR   failure during run      {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2025-06-19T10:14:26-07:00       FATAL   Unrecoverable error     {"Process": "storagenode", "error": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying readability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).verifyStorageDir:157\n\tstorj.io/common/sync2.(*Cycle).Run:163\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func1:111\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}

This time the node restarted cleanly but I still did some more digging as I’d like to understand what’s causing this, in case the next time I’m not so fortunate on the restart.

For reference, the storj datastore is located on my TrueNAS SCALE server using RAIDZ2. The connection is iSCSI.

The storj logs only show issues starting at around 10:11, but something was brewing before that.

Here’s the CPU load:

And the IOPS:

Which, even after dropping, the iowait stays ridiculously high.

But there’s no corresponding ingress of network traffic, to account for the increasing write load:

Nothing looked out of place on the TrueNAS side while this was going on, other than the increase in data writes, which was well within manageable proportions. There was no indication in the CPU numbers of any difference in iowait during this period.

So, anyone have any thoughts on what’s causing this spike.

Cheers.

arrogantrabbit · June 19, 2025, 8:29pm

I do. That’s expected. Im surprised it survived so far into the madness as it did. See below.

Ouch!

So, with raidz2 your array’s IOPs capacity is bareley that of a single HDD. So, the ceiling of 400iops, is actually quite amazing.

Now, when you hit the IOPS celing, transactions start piling up, node tries to cache them, and ultimately is kiled by OOM watcher.

Most importantly, with iscsi you are removing any and all ability of the filesystem to optimize access. iSCSI + raidz2 == fail. It’s never an appropriate combination.

You can do a number of things:

Stop using iSCSI, use NFS instead if you have to run node on another host. (don’t! see below). ZFS is amazing if you don’t cripple its ability to be a filesytem by hiding everything by opaque block storage protocol. NFS will give filesystem a chance to optimize access and at least cache metadata in ARC. You have to have a lot of free ram. 64GB would be a start.
Intead of even NFS, run node directly on truenas. There is no reason not to, and many reasons against.
Rebuild your pool with a number of raidz1 vdevs. You don’t need raidz2. If your other workloads need snazzy iSCSI – you’ll have to use mirrors.
Add special device to your pool. Or, at leaset, L2ARC of at least 500GB.
Ensure databses are fully accelerated (set small blocks size to record size for the dataset where databasses live)
Read this Notes on storage node performance optimization on ZFS

EddieA · June 19, 2025, 9:17pm

Yeah, but it’s a case of use what hardware I already have and is underused.

storj was killed, it committed suicide.

I didn’t think NFS was recommended/supported.

Shhhhhh, there’s already one running there as well.

Too far down the path to be able to offload/rebuild/reload.

Already in place.

But I’d still like to know why write/iops suddenly started to rise when there’s no major change in network ingress leading to the situation.

Cheers.

arrogantrabbit · June 19, 2025, 9:30pm

Your hardware is fine. Your softare configuration is not. Even before you consider storj, if you need to use iSCSI, you shall not have raidz2 in that pool.

Yes, it’ was an acciedent, it fell on the knife 48 times.

Look, you need to consider the whole system. iSCIS will be drastically better than NFS, if your serever was configurd for iSCSI. It isn’t. So, NFS is the next besth thing.

So? I have five on the same truenas. Espeically since you arleady run one – run the second one the same way.

Note, this will do nothing for iSCSI. This will help with NFS and direct.

Network activity, filewalker, does not matter, and it is not a the point. You have a hard limit of 300-400IOPS. 1IOPS below that – all is great. 1 IOPS above – everything explodes.

Properly configured, you shall se 20-30 IOPS reachign the disks. That’s what I see with 5 nodes (same IP, so ingress shared, but egres is not)

Julio · June 19, 2025, 9:48pm

Deletes are writes too. If you want to ignore the problem a little more defrag the block device. Or easier, just increase the read & write time-outs, maybe and extra minute or two can get you through the extended periods of I/O that your system can’t handle.
Specifically, edit config.yaml:

how frequently to verify the location and readability of the storage directory

storage2.monitor.verify-dir-readable-interval: 5m0s
storage2.monitor.verify-dir-readable-timeout: 2m30s

how frequently to verify writability of storage directory

storage2.monitor.verify-dir-writable-interval: 5m0s
storage2.monitor.verify-dir-writable-timeout: 2m30s
Do not exceed 5m on the timeout or the interval.
Also consider your network quality and other workloads, storj makes a lot of jitter over iscsi.

2 cents,
Julio

EddieA · June 19, 2025, 10:31pm

To do either of these would mean converting the zvol to a dataset. Is there any easy way to do this on the TrueNAS box as it appears there’s no way to mount it locally, or am I looking at copying/rsyncing the data from the iSCSI client back over the network to TrueNAS, which is going to take quite some time.

Cheers.

arrogantrabbit · June 19, 2025, 10:47pm

No, these are fundamentally different entities. Even if you had a zfs on teh iSCSI volume, you would still need to send/receve it. But likely you dont, the point is moot. For all intents and purposes data is only available on the initiator machine, so you woudl need to copy files from there to the nas over the network.

Use the node migration workflow to minimize node downtime How do I migrate my node to a new device? - Storj Docs, then it won’t matter how long it takes. It will eventually finish.

EddieA · June 20, 2025, 9:53pm

Think I’ve identified the culprit for the increase in iops: retain job

Cheers.

arrogantrabbit · June 20, 2025, 11:04pm

The problem is not a retain job. The problem is that your pool cannot handle more that 400iops. Fixing this will drastically improve performance for your users, and as a side effect improve node performance.