Retain process is not running on my us1.storj.io node

DanielTB80 · March 17, 2024, 8:29pm

Greetings,

The retain process is not currently running on my US Storage node.

I have been monitoring the process run cycle on my nodes and I noted that the process is running on my other three regions, but not the US one.

Upon reviewing the logs, it looks like it ran into an issue during the 3/1/2024 run and deleted all but six pieces and then has not run since.

I restarted the node on 3/10/24, but it did not run as expected on 3/15/2024. See attached screenshot summarization of the previous two weeks of retain process runs. Here are the relevant logs from the errors noted during the 3/1/2024 run. There are no errors associated with the retain process for the US node dated after 3/1/2024.

Storj is running via Windows, via ISCSI drive to NAS. Data on NAS is Synology’s SHR2 RAID config.

I would appreciate your advice on debugging this issue.

Thanks in advance.

2024-03-01T11:23:11-06:00 INFO retain Prepared to run a Retain request. {“Created Before”: “2024-02-21T17:59:59Z”, “Filter Size”: 4100003, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
2024-03-01T11:23:11-06:00 INFO lazyfilewalker.gc-filewalker starting subprocess {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
2024-03-01T11:23:11-06:00 INFO lazyfilewalker.gc-filewalker subprocess started {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
2024-03-01T11:23:11-06:00 INFO piecestore download started {“Piece ID”: “7X7Y5MSVCUJNRHKRAPR5ICOZDOM47BCI3ZJVMLVOSYFDV4ISCNHQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 217600, “Remote Address”: “79.127.223.129:52630”}
2024-03-01T11:23:12-06:00 INFO lazyfilewalker.gc-filewalker.subprocess Database started {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “process”: “storagenode”}
2024-03-01T11:23:12-06:00 INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker started {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “process”: “storagenode”, “createdBefore”: “2024-02-21T17:59:59Z”, “bloomFilterSize”: 4100003}
2024-03-01T12:02:28-06:00 INFO lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “piecesCount”: 34948349, “piecesSkippedCount”: 0, “process”: “storagenode”}
2024-03-01T12:02:29-06:00 INFO lazyfilewalker.gc-filewalker subprocess finished successfully {“satelliteID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}2024-03-01T12:10:25-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T12:10:25-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “4DK27CWN232HFKYML3OVV76OAAPLI545KDQB46DDFXJBLA3YD5GA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T12:43:24-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T12:43:24-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “HFLK3WGVOCC7TTXNRCC3CICYP2MFXQG3XUCY7A5KTWJFZ334Q5ZA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T12:53:48-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T12:53:48-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “KTKX4KVTA6TWM2RXPPEDEVH33DM6UR2JONH3H232P7MDU5HWEY3A”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T12:54:07-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T12:54:07-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “KWSRNMTO4VIZ7OC2R2NDWMGYVRP3BBH5N625T3SZPLKHXERDVAUA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T13:03:31-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T13:03:31-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “NP33MQY6MYXGWNWVJFXXFW2ZH5SZUW3DEWU4RR5YE2WGPOR22DYA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T13:20:04-06:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-03-01T13:20:04-06:00 WARN retain failed to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “SX2PXXJHZBJ3DDKAEOLH6JJYBTIXQCDPPPQNZCWJVCLNM66MF7UQ”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:373\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75”}
2024-03-01T13:42:37-06:00 INFO retain Moved pieces to trash during retain {“Deleted pieces”: 1736436, “Failed to delete”: 6, “Pieces failed to read”: 0, “Pieces count”: 34948349, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Duration”: “2h19m25.3197935s”, “Retain Status”: “enabled”}

Alexey · March 18, 2024, 8:30am

Seems these pieces already removed by the collector process earlier. Could you please search that piece in your logs to see its lifecycle?

DanielTB80 · March 19, 2024, 2:44am

Here is the log information for those six pieces:

2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: 4DK27CWN232HFKYML3OVV76OAAPLI545KDQB46DDFXJBLA3YD5GA}
2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: HFLK3WGVOCC7TTXNRCC3CICYP2MFXQG3XUCY7A5KTWJFZ334Q5ZA}
2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: KTKX4KVTA6TWM2RXPPEDEVH33DM6UR2JONH3H232P7MDU5HWEY3A}
2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: KWSRNMTO4VIZ7OC2R2NDWMGYVRP3BBH5N625T3SZPLKHXERDVAUA}
2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: NP33MQY6MYXGWNWVJFXXFW2ZH5SZUW3DEWU4RR5YE2WGPOR22DYA}
2024-03-01T11:55:27-06:00	INFO	collector	deleted expired piece	{Satellite ID: 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S, Piece ID: SX2PXXJHZBJ3DDKAEOLH6JJYBTIXQCDPPPQNZCWJVCLNM66MF7UQ}

Alexey · March 19, 2024, 2:52am

So, they already deleted automatically due to expiration. Is retain finished for this satellite?

sls retain "$env:ProgramFiles\Storj\Storage Node\storagenode.log" | sls "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S" | sls "started|finished" | select -last 4

DanielTB80 · March 19, 2024, 2:55am

It looks like it finished:

2024-03-01T13:42:37-06:00 INFO retain Moved pieces to trash during retain {“Deleted pieces”: 1736436, “Failed to delete”: 6, “Pieces failed to read”: 0, “Pieces count”: 34948349, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Duration”: “2h19m25.3197935s”, “Retain Status”: “enabled”}

DanielTB80 · March 19, 2024, 2:57am

My apologies, I missed your code. I will run it.

Alexey · March 19, 2024, 2:59am

It will give the same

DanielTB80 · March 19, 2024, 3:45am

I ran it for all four nodes to see if there was any difference in the results.
I didn’t see any new retain commands in the log.

Alexey · March 19, 2024, 3:54am

Oh, you are correct. For the retain you need to search not started|finished, but Prepare|Move…

sls retain "$env:ProgramFiles\Storj\Storage Node\storagenode.log" | sls "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S" | sls "Prepare|Move" | select -last 4

DanielTB80 · March 19, 2024, 3:56am

That generated a result. The below result looks identical to the 3/1/2024 data I previously snipped:

C:\Program Files\Storj\Storage Node\storagenode.log:3708959:2024-03-01T11:23:11-06:00 INFO retain Prepared to run a Retain request. {“Created Before”: “2024-02-21T17:59:59Z”, “Filter Size”: 4100003, “Satellite ID”:
“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”}
C:\Program Files\Storj\Storage Node\storagenode.log:3784584:2024-03-01T13:42:37-06:00 INFO retain Moved pieces to trash during retain {“Deleted pieces”: 1736436, “Failed to delete”: 6, “Pieces failed to read”: 0, “Pieces count”:
34948349, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Duration”: “2h19m25.3197935s”, “Retain Status”: “enabled”}

Ambifacient · April 7, 2024, 2:38pm

@DanielTB80 has your retain issue been fixed?

I am experiencing a similar issue, I have 3 large nodes about 10TB each, while being paid for around 7TB. Some of my smaller nodes have received the filters from US1 without issue.

I might have not received a bloom filter on Mar 17, I forgot to check, but I’m likely in the similar boat as you and Mar 1 was the last time I got one from US1. Week of Mar 24 was an instance where garbage collection was not run on any nodes. Week of Mar 31 these nodes did not receive a filter, and not it repeats again for week of Apr 7.

Sample logs from one of my nodes:

2024-03-30T23:48:14-07:00	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker started	{"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "bloomFilterSize": 1078086, "process": "storagenode", "createdBefore": "2024-03-26T17:59:59Z"}
2024-04-04T12:55:35-07:00	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker started	{"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode", "createdBefore": "2024-02-17T13:02:55Z", "bloomFilterSize": 30330}
2024-04-05T05:39:13-07:00	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker started	{"process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "process": "storagenode", "createdBefore": "2024-04-01T17:59:59Z", "bloomFilterSize": 283992}
2024-04-06T23:00:59-07:00	INFO	lazyfilewalker.gc-filewalker.subprocess	gc-filewalker started	{"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2024-04-02T17:59:59Z", "bloomFilterSize": 1092462}

It receives bloom filters from all others but US1. I will note that all three of these nodes have been online since the last weekend, due to node update.

DanielTB80 · April 7, 2024, 7:45pm

No change, I have not seen an us1.storj.io node retain run since March 1st.

The other nodes continue to show retain runs complete successfully.

littleskunk · April 8, 2024, 2:15pm

root@datengrab:~# grep retain /home/storagenode/logs/storagenode1.log.1 | grep 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S | grep -v 'About to move piece to trash'
2024-04-06T06:14:24+02:00       INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "/mnt/sn1/storagenode1/storagenode/retain", "Created Before": "2024-03-27T17:59:59Z", "Filter Size": 2277668, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-04-06T06:31:55+02:00       INFO    retain  Moved pieces to trash during retain     {"Process": "storagenode", "cachePath": "/mnt/sn1/storagenode1/storagenode/retain", "Deleted pieces": 1353374, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 8400999, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "17m31.453934572s", "Retain Status": "enabled"}