Two weeks working for free in the waste storage business :-(

jammerdan · July 7, 2024, 10:55am

There might be additional reasons too:

Lazyfilewalker GC does not resume from previously saved progress point after storagenode restart

opened 04:54AM - 07 Jul 24 UTC

Bug

Hello. Very useful feature to save-state-resume feature for GC filewalker is… marked as already done some time ago: https://github.com/storj/storj/issues/6708 https://review.dev.storj.io/plugins/gitiles/storj/storj/+/0f90f061b028a9c877dbed3c01d8c3d95e4bc518 And it listed as merged and in production since storagenode v1.102: https://github.com/storj/storj/commit/0f90f06 But tests on my nodes (v 1.104.5 and 1.105.4 running as windows service) show what this feature is not working properly. GC (running in "lazy mode" as separate process) DO save current progress to db (`garbage_collection_filewalker_progress.db`). But it does not use this information after node restart and begins GC process from scratch instead. **Steps to reproduce the issue:** - I taken node currently in the process of GC for satelliteID 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S - I checked which prefix it was checking at the moment (by monitoring the disk I/O of GC process). It was `\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\2m\` I also opened `garbage_collection_filewalker_progress.db` (I use sqlite3 tool to view it): ``` sqlite3 garbage_collection_filewalker_progress.db sqlite> .dump PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE versions (version int, commited_at text); INSERT INTO versions VALUES(55,'2024-05-02 03:37:50.4383409 +0300 MSK m=+1.158066201'); CREATE TABLE progress ( satellite_id BLOB NOT NULL, bloomfilter_created_before TIMESTAMP NOT NULL, last_checked_prefix TEXT NOT NULL, PRIMARY KEY (satellite_id) ); INSERT INTO progress VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000','2024-06-29 17:59:59.993737+00:00','2m'); COMMIT; ``` Looks like progress is saved. But here's an **important note**: I noticed that lazy GC writes the prefixs to this database immediately when it STARTS working on it. Whereas the name of the field in db table `last_checked_prefix` (as well as the description of the changes on the github) assume that the last prefix already processed should be stored in the database. Either this is an incorrect name and description of the field. Or could be another bug in the code that records the current (just started) prefix instead of the previous (last completed) one. - I stopped the node correctly: `sc stop storagenode` and waited while all storj processes finished and exited - I started node again: `sc start storagenode` - Checked which prefix GC is processing now. It was `\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\aa\` That is, it started its work from the beginning, the prefix stored in the database (\2m\), which was last processing during the previous launch, was ignored. - After some time progress in `garbage_collection_filewalker_progress.db` was also reset. **Logs:** storagenode.log around restart ``` 2024-07-07T05:32:53+03:00 INFO piecestore downloaded {"Piece ID": "ZOM7ROCSA4MGKZIUWNRHJIT6OUZIMEE7Y7MPXPKBUNB5CTQBQV4A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 181248, "Remote Address": "109.61.92.73:60450"} 2024-07-07T05:32:53+03:00 INFO piecestore downloaded {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.213.33:55368"} 2024-07-07T05:32:53+03:00 INFO Stop/Shutdown request received. 2024-07-07T05:32:53+03:00 ERROR piecestore download failed {"Piece ID": "TUHGHHLPRCLJ773QCZZO7YP6LDBKQV4EWAHKRUYRD54NPL2E3OYA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "Offset": 0, "Size": 46336, "Remote Address": "5.161.196.36:38340", "error": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection", "errorVerbose": "write tcp 192.168.0.2:28967->5.161.196.36:38340: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).rawFlushLocked:409\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:470\n\tstorj.io/common/pb.(*drpcPiecestore_DownloadStream).Send:408\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).sendData.func1:860\n\tstorj.io/common/rpc/rpctimeout.Run.func1:22"} 2024-07-07T05:32:54+03:00 INFO lazyfilewalker.gc-filewalker subprocess exited with status {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "status": 1, "error": "exit status 1"} 2024-07-07T05:32:54+03:00 ERROR pieces lazyfilewalker failed {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkSatellitePiecesToTrash:160\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:561\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"} 2024-07-07T05:32:54+03:00 ERROR filewalker failed to get progress from database 2024-07-07T05:32:54+03:00 ERROR retain retain pieces failed {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash:181\n\tstorj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash:568\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:373\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:259\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"} 2024-07-07T05:32:54+03:00 INFO lazyfilewalker.used-space-filewalker subprocess exited with status {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": 1, "error": "exit status 1"} 2024-07-07T05:32:54+03:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: exit status 1", "errorVerbose": "lazyfilewalker: exit status 1\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:85\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:32:54+03:00 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:32:54+03:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"} 2024-07-07T05:32:54+03:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:32:54+03:00 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:32:54+03:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"} 2024-07-07T05:32:54+03:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:32:54+03:00 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:32:54+03:00 ERROR lazyfilewalker.used-space-filewalker failed to start subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"} 2024-07-07T05:32:54+03:00 ERROR pieces failed to lazywalk space used by satellite {"error": "lazyfilewalker: context canceled", "errorVerbose": "lazyfilewalker: context canceled\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*process).run:73\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*Supervisor).WalkAndComputeSpaceUsedBySatellite:130\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:707\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:32:54+03:00 ERROR piecestore:cache error getting current used space: {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled", "errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite:716\n\tstorj.io/storj/storagenode/pieces.(*CacheService).Run:58\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"} 2024-07-07T05:35:03+03:00 INFO Configuration loaded {"Location": "D:\\Storj_Data\\Storage Node\\config.yaml"} 2024-07-07T05:35:04+03:00 INFO Anonymized tracing enabled 2024-07-07T05:35:04+03:00 INFO Operator email {"Address": "***"} 2024-07-07T05:35:04+03:00 INFO Operator wallet {"Address": "***"} 2024-07-07T05:35:04+03:00 INFO Telemetry enabled {"instance ID": "***"} 2024-07-07T05:35:04+03:00 INFO Event collection enabled {"***"} 2024-07-07T05:35:04+03:00 INFO db.migration Database Version {"version": 57} 2024-07-07T05:35:04+03:00 INFO preflight:localtime start checking local system clock with trusted satellites' system clock. 2024-07-07T05:35:05+03:00 INFO preflight:localtime local system clock is in sync with trusted satellites' system clock. 2024-07-07T05:35:05+03:00 INFO Node **** started 2024-07-07T05:35:05+03:00 INFO Public server started on [::]:28967 2024-07-07T05:35:05+03:00 INFO Private server started on 127.0.0.1:7778 2024-07-07T05:35:05+03:00 INFO bandwidth Persisting bandwidth usage cache to db 2024-07-07T05:35:05+03:00 INFO trust Scheduling next refresh {"after": "6h26m36.718354629s"} 2024-07-07T05:35:05+03:00 INFO pieces:trash emptying trash started {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:35:05+03:00 INFO retain Prepared to run a Retain request. {"cachePath": "D:\\Storj_Data\\Storage Node/retain", "Created Before": "2024-06-29T20:59:59+03:00", "Filter Size": 10861233, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:05+03:00 WARN piecestore:monitor Disk space is less than requested. Allocated space is {"bytes": 5305067636272} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.used-space-filewalker starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:05+03:00 INFO piecestore download started {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196"} 2024-07-07T05:35:05+03:00 INFO piecestore download started {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.used-space-filewalker subprocess started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:05+03:00 INFO piecestore download started {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess started {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:35:05+03:00 ERROR piecestore download failed {"Piece ID": "O4YOG4ZJUGK53W55FQ3I3TWS4P6XCGJL2CQEUFSLIFKY2ZDS5LJA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 9984, "Remote Address": "109.61.92.78:57744", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.150.199.48:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"} 2024-07-07T05:35:05+03:00 ERROR piecestore download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54196", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.gc-filewalker starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:05+03:00 ERROR piecestore download failed {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.226.100:54208", "error": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled", "errorVerbose": "trust: rpc: tcp connector failed: rpc: dial tcp 34.159.134.91:7777: operation was canceled\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.gc-filewalker subprocess started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:05+03:00 INFO piecestore download started {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"} 2024-07-07T05:35:05+03:00 INFO piecestore download started {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.used-space-filewalker.subprocess Database started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"} 2024-07-07T05:35:05+03:00 INFO lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker started {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "dateBefore": "2024-06-30T05:35:05+03:00", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess Database started {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker completed {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully {"satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash finished {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "elapsed": "448.0257ms"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash started {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess Database started {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker completed {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully {"satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash finished {"Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "elapsed": "251.0143ms"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash started {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:06+03:00 INFO piecestore downloaded {"Piece ID": "MQN2PAHQJAUTWJFZAB3PS64LHKD43CIAYZDFLTK5SNOYX3JF2EGA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 109824, "Size": 256, "Remote Address": "34.124.189.53:33711"} 2024-07-07T05:35:06+03:00 INFO piecestore download started {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"} 2024-07-07T05:35:06+03:00 INFO piecestore downloaded {"Piece ID": "Z34OT3LZDEYU7S7NMGZSJD2GJCIOHAOKJNII5KKRZFRVFDGIB2VA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "GET_AUDIT", "Offset": 1358080, "Size": 256, "Remote Address": "34.124.189.53:52220"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess Database started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker completed {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "numKeysDeleted": 0, "Process": "storagenode", "bytesDeleted": 0} 2024-07-07T05:35:06+03:00 INFO piecestore downloaded {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.46:58210"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash finished {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "elapsed": "331.019ms"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash started {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker starting subprocess {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess started {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.gc-filewalker.subprocess Database started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker started {"satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Process": "storagenode", "createdBefore": "2024-06-29T20:59:59+03:00", "bloomFilterSize": 10861233} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker started {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "dateBefore": "2024-06-30T05:35:06+03:00"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess Database started {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode"} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker.subprocess trash-filewalker completed {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Process": "storagenode", "bytesDeleted": 0, "numKeysDeleted": 0} 2024-07-07T05:35:06+03:00 INFO lazyfilewalker.trash-cleanup-filewalker subprocess finished successfully {"satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"} 2024-07-07T05:35:06+03:00 INFO pieces:trash emptying trash finished {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "elapsed": "267.0152ms"} 2024-07-07T05:35:07+03:00 INFO piecestore download started {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"} 2024-07-07T05:35:07+03:00 INFO piecestore downloaded {"Piece ID": "PP6ULDTCKKKZSHRN3EHXHOILK33OR4X7KF2TRS5EQT5AU3PBYGDQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET", "Offset": 0, "Size": 1792, "Remote Address": "79.127.219.37:44700"} 2024-07-07T05:35:07+03:00 INFO piecestore download started {"Piece ID": "VCBCOUI4ZXAAPT4EXULQTWVIMZJUS5JM7HLXMDLGEITNWS5QI5LA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET", "Offset": 0, "Size": 6912, "Remote Address": "109.61.92.73:36386"} ``` I also recoeded full trace of disk IO of GC processed (two PIDs - before and after restart) by using Process Monitor. Attach it as zipped .csv as it large (log of ~5k IOP) [Garbage_Collector_IO.zip](https://github.com/user-attachments/files/16118153/Garbage_Collector_IO.zip)

BrightSilence · July 7, 2024, 11:05am

Not really, no. Yes, there have been other issues and some are not being picked up with priority. The big difference is that those issues were either being actively worked on or impacted a smaller subset of nodes, with much lower impact. This issue impacts all nodes and has caused immense overuse with garbage data for everyone.

I appreciate you keeping them sharp on other things though. But I don’t share your general position of mismanagement on these issues. Hence why my post says this is uncharacteristically bad.

Dunc4n1d4h0 · July 7, 2024, 12:17pm

Where can I get this script?

BrightSilence · July 7, 2024, 12:43pm

Dunc4n1d4h0 · July 7, 2024, 1:01pm

Thanks for fast reply;
Here we go guys:
–node1:

July 2024 (Version: 14.0.0)                                             [snapshot: 2024-07-07 12:56:51Z]
REPORTED BY     TYPE      METRIC                PRICE                     DISK  BANDWIDTH        PAYOUT
Node            Ingress   Upload                -not paid-                      705.68 GB
Node            Ingress   Upload Repair         -not paid-                       19.91 GB
Node            Egress    Download              $  2.00 / TB (avg)               50.43 GB       $  0.10
Node            Egress    Download Repair       $  2.00 / TB (avg)               29.72 GB       $  0.06
Node            Egress    Download Audit        $  2.00 / TB (avg)               17.44 MB       $  0.00
Node            Storage   Disk Current Total    -not paid-             6.13 TB
Node            Storage              ├ Blobs    -not paid-             4.87 TB
Node            Storage              └ Trash  ┐ -not paid-             1.26 TB
Node+Sat. Calc. Storage   Uncollected Garbage ┤ -not paid-           870.91 GB
Node+Sat. Calc. Storage   Total Unpaid Data <─┘ -not paid-             2.13 TB
Satellite       Storage   Disk Last Report      -not paid-             4.00 TB
Satellite       Storage   Disk Average So Far   -not paid-             3.99 TB
Satellite       Storage   Disk Usage Month      $  1.49 / TBm (avg)  866.34 GBm                 $  1.29
________________________________________________________________________________________________________+

–node2:

July 2024 (Version: 14.0.0)                                             [snapshot: 2024-07-07 13:00:16Z]
REPORTED BY     TYPE      METRIC                PRICE                     DISK  BANDWIDTH        PAYOUT
Node            Ingress   Upload                -not paid-                      714.00 GB
Node            Ingress   Upload Repair         -not paid-                       19.87 GB
Node            Egress    Download              $  2.00 / TB (avg)               50.33 GB       $  0.10
Node            Egress    Download Repair       $  2.00 / TB (avg)               18.04 GB       $  0.04
Node            Egress    Download Audit        $  2.00 / TB (avg)                6.41 MB       $  0.00
Node            Storage   Disk Current Total    -not paid-             4.39 TB
Node            Storage              ├ Blobs    -not paid-             4.13 TB
Node            Storage              └ Trash  ┐ -not paid-           255.45 GB
Node+Sat. Calc. Storage   Uncollected Garbage ┤ -not paid-           591.25 GB
Node+Sat. Calc. Storage   Total Unpaid Data <─┘ -not paid-           846.70 GB
Satellite       Storage   Disk Last Report      -not paid-             3.54 TB
Satellite       Storage   Disk Average So Far   -not paid-             3.53 TB
Satellite       Storage   Disk Usage Month      $  1.49 / TBm (avg)  733.07 GBm                 $  1.09
________________________________________________________________________________________________________+

Alexey · July 7, 2024, 1:18pm

I didn’t know that we may pause BF as now. Perhaps there is a bug. Pinged the team as well.

My nodes have “only” 1/2 of the garbage as well. 5.27TB from 9.28TB to be precise…

Climbingkid · July 7, 2024, 1:40pm

@Alexey - Im wanting to avoid any negativity - but this is what we have been calling out for weeks now, having rulled out all other causes.

Why are we only now approaching developers?

Thanks
CC

littleskunk · July 7, 2024, 2:02pm

Even with no bloom filter SLC should still delete the pieces by TTL.

Can you maybe check /mon/ps output? I noticed that some of my nodes are hours behind. For example this one:

[792702298643105108,590377817016879825] storj.io/storj/storagenode/collector.(*Service).Collect() (elapsed: 9h5m52.90537853s)
 [995026780269330392,590377817016879825] storj.io/storj/storagenode/pieces.(*Store).GetExpired() (elapsed: 9h5m52.905361937s)
  [1197351261895555675,590377817016879825] storj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired() (elapsed: 9h5m52.905363011s)
   [8327577142405913580,590377817016879825] storj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0() (elapsed: 110.590232ms)
    [8529901624032138863,590377817016879825] storj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat() (elapsed: 110.576307ms)
     [8732226105658364147,590377817016879825] storj.io/storj/storagenode/blobstore/filestore.(*Dir).Stat() (elapsed: 110.573696ms)

littleskunk · July 7, 2024, 2:19pm

Because there are a few very loud people in the community that call every bug highest priority. If everything gets highest priority then nothing gets fixed. It is as simple as that. It would help if we could downgrade a few bugs and just tolerate them for now to make sure the developers have more time to work on the important problems. It is just too many context switches at some point.

The other problem is that the community is getting a hostile place. We are fixing bugs but the same loud people are still demanding that the developers are doing something wrong. What do you think is going to happen? Every human beeing will just stop reading the demotivating speech. So developers will simply stay away from the forum at some point. So we are losing the healthy communication line we had before.

Roxor · July 7, 2024, 2:24pm

I think the devs are doing a great job!

When a disk fills: I expand. If some of that disk is trash: don’t care: that’s a problem that will be fixed: still expanding.

littleskunk · July 7, 2024, 2:30pm

We could still spend time on filing a good bug report with as much details as possible. There is some middle ground.

BrightSilence · July 7, 2024, 2:31pm

And that is exactly why I pick my battles and don’t turn one larger issue into “everything is horrible and you suck”. I don’t think that’s the case at all and I recognize and see the progress being made. But I think this is one worth fighting.

I don’t currently have my debug port open in the docker container and I don’t want to restart my node right now. But here is what I can tell you. My node is hard at work deleting expired data (I still have debug logging on, which is probably a bad idea for IO right now, but that’s why I know). I also know this node already had 5.5TB of uncollected garbage before the TTL clean up even kicked in. So the vast majority of it is not from being behind on TTL cleanup, but up to about 2.5TB might be.

It also shows quite significant gaps where the collector doesn’t run. Is this expected behavior?

2024-06-30T02:21:22Z    INFO    collector       collect {"Process": "storagenode", "count": 1910}
2024-06-30T03:25:55Z    INFO    collector       collect {"Process": "storagenode", "count": 1915}
2024-06-30T04:17:29Z    INFO    collector       collect {"Process": "storagenode", "count": 942}
2024-06-30T07:21:49Z    INFO    collector       collect {"Process": "storagenode", "count": 1676}
2024-06-30T08:14:58Z    INFO    collector       collect {"Process": "storagenode", "count": 474}
2024-06-30T09:36:13Z    INFO    collector       collect {"Process": "storagenode", "count": 7556}
2024-06-30T10:49:50Z    INFO    collector       collect {"Process": "storagenode", "count": 12996}
2024-06-30T11:43:41Z    INFO    collector       collect {"Process": "storagenode", "count": 5698}
2024-06-30T12:38:14Z    INFO    collector       collect {"Process": "storagenode", "count": 6844}
2024-06-30T13:35:05Z    INFO    collector       collect {"Process": "storagenode", "count": 7032}
2024-06-30T14:37:16Z    INFO    collector       collect {"Process": "storagenode", "count": 7437}
2024-06-30T19:33:53Z    INFO    collector       collect {"Process": "storagenode", "count": 80966}
2024-07-01T14:57:10Z    INFO    collector       collect {"Process": "storagenode", "count": 333967}
2024-07-05T13:08:21Z    INFO    collector       collect {"Process": "storagenode", "count": 1453569}
2024-07-05T14:37:29Z    INFO    collector       collect {"Process": "storagenode", "count": 144170}

littleskunk · July 7, 2024, 2:55pm

By default the storage node will open a random port. I don’t run a docker node. Is it possible to exec into it? That way you can still get the output even with no port forwarding

Climbingkid · July 7, 2024, 3:19pm

@littleskunk - I totally get what your saying, I run a dev team. I dont think I have been loud, hostile, or demotivating in case thats what you feel - all my posts have been respectful.

Equally - this may not be a bug, but lack of bloom filters right? Either way this is pretty crucial to the whole Storj mantra, or reason for being, ie. usable diskspace. It weakens any upgrade or expansion discussion if space is being knowingly wasted too.

It kinda feels important to at least triage, dont you think? I am at your disposal too to help, just tell me what other steps or infdormation I can furnish you with.

Thanks
CC

Mitsos · July 7, 2024, 4:23pm

Bugs/features that have a direct effect on customers are the highest priority. You can’t feed the company with no customers.
Bugs/features that have a direct effect on SNOs are the 2nd highest priority. You can’t get more customers if there are no SNOs to serve them.
Bugs/features that have an indirect effect on customers are the 3rd highest priority. Sure it will be be nice to fix them as soon as possible, but we can kick the logo change down the line a bit.
Bugs/features that have an indirect effect on SNOs are the 4th highest priority. Good to fix them, not directly affecting the network performance/operation.
Every bug/feature that doesn’t fall in either of those categories (ie satellite reporting how many active nodes there are), gets to the end of the line.

Currently this isn’t happening, which is why the loud people are getting louder and louder. Every single thing we have reported in the past year gets a “low priority because we have other things to deal with”. At some point one does have to wonder: if everything isn’t being worked on, what exactly is being worked on?

Allow me to correct you there. You are fixing bugs that should have been fixed a long time ago. It’s not the same as actively fixing bugs. This, as I said, is why communication is breaking down. SNOs will simply give up on reporting bugs if everything gets pushed down the line more and more, and just let the network collapse with storj wondering “why is the network full?”.

Sorting the github open issues page by oldest shows that the oldest open bug is from 2019. In 5 years nobody could find the time to either close it or work on it?

But I have the bad habit of giving people the benefit of the doubt. Let’s say the devs are so busy that they barely even find the time to sleep. Doesn’t that raise a few red flags? Sounds to me that more devs should be hired, but that’s what I would personally do, not saying you should do it.

littleskunk · July 7, 2024, 4:26pm

Time to stop following this thread myself. Have fun complaining.

Mitsos · July 7, 2024, 4:33pm

See ya around, take care.

ItsHass · July 7, 2024, 4:47pm

What is the actual solution ?

Mitsos · July 7, 2024, 5:59pm

The actual solution is to get the storagenode part working as it should, then get the satellites to report the correct data (wrt tracked pieces) which should fix the timely bloom filter generation, which loops back to the storagenode part. Uncollected garbage starts getting collected as it should, space is freed and tracked, everyone is happy again.

My $0.02. I’m sure others will disagree on the order of these events, and I’m cool with that.

agente · July 7, 2024, 6:04pm

Need a fix. Soon. People around starting to think that you are not able to run a storage nodes activity… words spreading
It is not acceptable and compromises the credibility of the project