HDD failing, filewalker / bloom filter issue? What am I looking at?

Seb · June 29, 2024, 1:35pm

I have run my node for about 3 months, no issue, and it has suddenly starting to fail. It has been offline so much lately that I’m approaching the suspension threshold… I have browsed the forum, but the issue I’m seeing in the logs has been reported about several issues in the past, so I really don’t know where to start to try to fix it. My hdd is only 4 months old.

Here is my log:

2024-06-29T13:25:40Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-29T13:25:40Z    ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-29T13:25:40Z    ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-29T13:25:40Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-29T13:25:40Z    ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "context canceled"}
2024-06-29T13:25:40Z    ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-29T13:25:40Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-29T13:25:40Z    ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-29T13:25:40Z    ERROR   pieces  failed to lazywalk space used by satellite      {"Process": "storagenode", "error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-29T13:25:40Z    ERROR   piecestore:cache        error getting current used space:       {"Process": "storagenode", "error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
2024-06-29T13:27:59Z    INFO    Downloading versions.   {"Process": "storagenode-updater", "Server Address": "https://version.storj.io"}
2024-06-29T13:27:59Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode", "Version": "v1.105.4"}
2024-06-29T13:27:59Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode"}
2024-06-29T13:27:59Z    INFO    Current binary version  {"Process": "storagenode-updater", "Service": "storagenode-updater", "Version": "v1.105.4"}
2024-06-29T13:27:59Z    INFO    Version is up to date   {"Process": "storagenode-updater", "Service": "storagenode-updater"}

Alexey · June 30, 2024, 3:48am

You need to search for FATAL error somewhere earlier, these ones looks like a consequence of stopping the node due to a FATAL error.

Seb · June 30, 2024, 11:23pm

The only FATAL error I can find is that one (before and after the error, all races are systematically lost):

2024-06-30T22:33:47Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"Process": "storagenode", "Piece ID": "OUES577QXYKK2JC2TLUZ46T7KOMD7GNSFPS7SBS2O4U5K3RJDLWA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "79.127.219.34:33140"}
2024-06-30T22:33:47Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "LAL3RAHNGQGAE5UQGLFA6EFIQBS5ZELKBJ7HBWXKWHL7JV3FIOZQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT_REPAIR", "Remote Address": "128.140.109.185:36774", "Size": 1114112}
2024-06-30T22:33:49Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "L6GLP6TYGUYDVS5OGNTJY6UDNGYTCKJOIOVN6VFGRXBQKQIYTTRQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "109.61.92.81:43602", "Size": 196608}
2024-06-30T22:33:49Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"Process": "storagenode", "Piece ID": "FM236SLQVP5VAC6M5SXNOC4HFOCKNPLV26GKRQQZ7OCAKWWZZGPA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "79.127.219.42:41832"}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"Process": "storagenode", "Piece ID": "KGKFUHDNMOHCUE75TBNFBAW5IB4XIB4APEU6JH6E2DFBZSRGIJMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address": "79.127.205.226:44006"}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"Process": "storagenode", "Piece ID": "477O3ONTDHULJXEPB4W6HG4ZUXUFRWRNVSWHMHYHUCIFSS3PCFPA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "79.127.205.242:56298"}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled (race lost or node shutdown)    {"Process": "storagenode", "Piece ID": "N3WHYA5O4L6SSSJZSZEEZTSSKXHCAFL45PIHFY5YRKENWLEJ7EMQ", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "109.61.92.84:52922"}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "ZB6IH4HU2265JK6A6XHLG4A6NHETTRNYSTSUGMWBICR6GEFZHPLA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.61.255:38192", "Size": 196608}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "AAHJW7FDRHHYR6GSWK3JDXHYPEXDA2GUNZNUQBL7OEOTATQYNTGA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address": "109.61.92.70:35504", "Size": 196608}
2024-06-30T22:33:50Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "HQ4B2NF4POAXXNKXKGVPUM2XRUGJQWTKC36IOZF3KQI52LSFEAAQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.251.58:38328", "Size": 1376256}
2024-06-30T22:33:51Z    INFO    piecestore      upload canceled {"Process": "storagenode", "Piece ID": "VTMTWE4XPJC6JNXXAE7WEGEHVWNDJEWRTZMBS5A4O2U5RPVDS3WQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT_REPAIR", "Remote Address": "116.203.134.162:36308", "Size": 1376256}
2024-06-30T22:33:57Z    ERROR   failure during run      {"Process": "storagenode", "error": "database is locked"}
2024-06-30T22:33:57Z    FATAL   Unrecoverable error     {"Process": "storagenode", "error": "database is locked"}
2024-06-30T22:34:05Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-06-30T22:34:05Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2024-06-30T22:34:05Z    INFO    Operator email  {"Process": "storagenode", "Address": "********"}
2024-06-30T22:34:05Z    INFO    Operator wallet {"Process": "storagenode", "Address": "********"}
2024-06-30T22:34:14Z    INFO    server  kernel support for server-side tcp fast open remains disabled.  {"Process": "storagenode"}
2024-06-30T22:34:14Z    INFO    server  enable with: sysctl -w net.ipv4.tcp_fastopen=3  {"Process": "storagenode"}
2024-06-30T22:34:15Z    INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "123NZkXddWMygQaDcMcu6hduatK2UXyGuPW74g1wRBfWtaj62L6"}
2024-06-30T22:34:15Z    INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "123NZkXddWMygQaDcMcu6hduatK2UXyGuPW74g1wRBfWtaj62L6"}
2024-06-30T22:34:19Z    INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
2024-06-30T22:34:19Z    INFO    Anonymized tracing enabled      {"Process": "storagenode"}
2024-06-30T22:34:19Z    INFO    Operator email  {"Process": "storagenode", "Address": "storj@sl.joumail.ca"}
2024-06-30T22:34:19Z    INFO    Operator wallet {"Process": "storagenode", "Address": "0xfc4595f635f57F72243D76E7565A83Abb9766405"}
2024-06-30T22:34:21Z    INFO    server  kernel support for server-side tcp fast open remains disabled.  {"Process": "storagenode"}
2024-06-30T22:34:21Z    INFO    server  enable with: sysctl -w net.ipv4.tcp_fastopen=3  {"Process": "storagenode"}
2024-06-30T22:34:21Z    INFO    Telemetry enabled       {"Process": "storagenode", "instance ID": "123NZkXddWMygQaDcMcu6hduatK2UXyGuPW74g1wRBfWtaj62L6"}
2024-06-30T22:34:21Z    INFO    Event collection enabled        {"Process": "storagenode", "instance ID": "123NZkXddWMygQaDcMcu6hduatK2UXyGuPW74g1wRBfWtaj62L6"}
2024-06-30T22:34:21Z    INFO    db.migration    Database Version        {"Process": "storagenode", "version": 57}
2024-06-30T22:39:06Z    INFO    preflight:localtime     start checking local system clock with trusted satellites' system clock.        {"Process": "storagenode"}
2024-06-30T22:39:07Z    INFO    preflight:localtime     local system clock is in sync with trusted satellites' system clock.    {"Process": "storagenode"}
2024-06-30T22:39:07Z    INFO    Node 123NZkXddWMygQaDcMcu6hduatK2UXyGuPW74g1wRBfWtaj62L6 started        {"Process": "storagenode"}
2024-06-30T22:39:07Z    INFO    Public server started on [::]:28967     {"Process": "storagenode"}
2024-06-30T22:39:07Z    INFO    Private server started on 127.0.0.1:7778        {"Process": "storagenode"}
2024-06-30T22:39:07Z    INFO    pieces:trash    emptying trash started  {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-30T22:39:07Z    INFO    bandwidth       Persisting bandwidth usage cache to db  {"Process": "storagenode"}

Knowledge · June 30, 2024, 11:36pm

So, with database locks, you likely have a situation where the drive is too slow for the amount of traffic happening. You can move the database to SSD if that is an option. It can also depend a lot on your configuration. Like, you dont want this to be a network drive or some unusual RAID configuration. Can you tell us a bit about your config?

Seb · June 30, 2024, 11:50pm

Thanks for that, this is really helpful. I do have an ssd available so I will try to transfer the database tonthat one and see if it works (before I get suspended…). My hdd is only 5400rpm and the database is stored there. I will let you know, regardless of the outcome.

Edit: databases have been moved to an SSD drive. Crossing fingers that it solves the issues.

Seb · July 2, 2024, 9:30pm

Update: the node has been successfully running, currently for over 12 hours, with a couple of interruptions before that (nothing compared to what it was). The average ingress bandwith picked up again (twice what it was before).

The vitals are surely improving, I think the patient will make it!