Us1.storj.io disk average 30% lower than actual disk usage

After the recent announcements of payout cuts, i took a look at the dashboard of my node.
I quickly noticed that the “Average Disk Space Used This Month” was around 20% lower than the “Total Disk Space”.
After checking the payouts of the last month and the actual disk usage of the storage folder on my hdd, i concluded that this was not just some display issue and there was around 2.2TB of unaccounted data on my node.

After looking trough the forum, i found this post which describes a similar issue: 4 TB node has been full the past few months, still only 2.17 TBm storage paid - #48 by Ottetal
However this seems to have been solved by a satellite update months ago and there is nothing i can/need to do.

I also found this post about forgetting the recently removed test satellites: How To Forget Untrusted Satellites
I did the steps described in there and indeed around 1 TB of data was removed for the satellites. I also repeated these steps to remove any possible data from the decommissioned stefan-benten satellite as my node is old enough to have used this one.

That still left me with >1TB of unaccounted data. I used ncdu and the satellite info post to check which satellite was using how much disk space and compared that to the averages that these satellites reported in the dashboard. I also used the earnings.py script with the debug lines to verify my findings.
The numbers are correct for all satellites except us1.storj.io, which consistently reports 1.1TB less than it actually uses:

├────────────────────────────────┼─────────────┼──────────────────────────┼─────────────────────────────────────────────────────────────┤
│ us1.storj.io:7777 (OK)         │             │                          │  $  1.49/TBm $  6.00/TB  $  6.00/TB        0%        100%   │
│              2019-05-14    54  │   $   0.03  │   0.00%   0.00%   0.05%  │  $  3.2232   $  0.7501   $  0.0855  -$  0.0000   $  4.0588  │
Disk Average So Far (debug):   2.29 TB  >> 68% of expected   3.38 TB <<
├────────────────────────────────┼─────────────┼──────────────────────────┼─────────────────────────────────────────────────────────────┤ +

I run the node on a Linux host (Ubuntu 20.04) with docker and watchtower. The file system is ext4 and there is still more than enough space left on the disk as well as the configured max storage for the node.

Looking trough the logs, i only found one retain request from that satellite, and that one failed to delete a piece:

e[36mstoragenode    |e[0m 2023-10-25T12:47:55Z	INFO	retain	Prepared to run a Retain request.	{"process": "storagenode", "Created Before": "2023-10-18T17:59:59Z", "Filter Size": 2097155, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
e[36mstoragenode    |e[0m 2023-10-25T14:01:25Z	WARN	retain	failed to delete piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "NFYNEPEQ23263EBAUG3W5CL7JCVMAY3JJ3BTD4L7PFOCVU7I4VYA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

Retain requests for all other satellites work as expected.

Is there anything i can do to fix this? Is there any way to tell the storage node to reindex what is there and compare to what should be there?

Make sure your docker run command allows it to filewalk on start up, this tends to resolves it once the filewalk finish.

You can see in the log or run a

ps aux | grew filew

To see if it is still running outside of docker.

I think the filewalker works:

$ ps aux | grep filew
root      480817  4.3  0.0 743120 27076 ?        Sl   14:54   0:34 /app/storagenode used-space-filewalker --storage config/storage --info config/storage/piecestore.db --info2 config/storage/info.db --pieces config/storage --driver  --filestore.write-buffer-size 4.0 MiB --log.output stderr --log.encoding json --lower-io-priority=true

I can also see entries for the filewalker for the us1 node in the logs:

e[36mstoragenode    |e[0m 2023-10-29T07:59:38Z	INFO	lazyfilewalker.used-space-filewalker	starting subprocess	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
e[36mstoragenode    |e[0m 2023-10-29T07:59:38Z	INFO	lazyfilewalker.used-space-filewalker	subprocess started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
e[36mstoragenode    |e[0m 2023-10-29T07:59:38Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	Database started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
e[36mstoragenode    |e[0m 2023-10-29T07:59:38Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	used-space-filewalker started	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
e[36mstoragenode    |e[0m 2023-10-29T11:36:30Z	INFO	lazyfilewalker.used-space-filewalker.subprocess	used-space-filewalker completed	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode", "piecesTotal": 3591495481591, "piecesContentSize": 3578144538359}
e[36mstoragenode    |e[0m 2023-10-29T11:36:30Z	INFO	lazyfilewalker.used-space-filewalker	subprocess finished successfully	{"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

Looking at the value of piecesContentSize it seems to also have counted 3+ TB of data

There is indeed a problem. On the US1 satellite, multiple of my nodes have around 0.6TB of unpaid data, and there is no significant difference among other satellites.

Another issue is that the garbage collection of the US1 satellite always cannot be successfully completed, and the directory display file does not exist.

Maybe the retain process is ending prematurely upon encountering a missing file, because when that is the case i can’t find the log message with the summary like this:

INFO	retain	Moved pieces to trash during retain	{"process": "storagenode", "num deleted": 56785, "Retain Status": "enabled"}

In my nodes i have a lot of missing file errors, but all of them are from pieces that were expired before. For this big satellite the process hardly ever finish

2023-10-25T12:46:51Z	INFO	collector	deleted expired piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ADWMEIZXPT2PKWOGAJYD2G5KNNA5NVGH7OUEAVKH7QGPBEQQCCQA"}
2023-10-25T15:53:35Z	WARN	retain	failed to delete piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "ADWMEIZXPT2PKWOGAJYD2G5KNNA5NVGH7OUEAVKH7QGPBEQQCCQA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2 Likes

I can confirm this. The piece that is “missing” in my logs was deleted as expired just hours earlier:

e[36mstoragenode    |e[0m 2023-10-25T13:03:59Z	INFO	collector	deleted expired piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "NFYNEPEQ23263EBAUG3W5CL7JCVMAY3JJ3BTD4L7PFOCVU7I4VYA"}

I don’t see a message confirming that the retain request was finished successfully so my guess is that the piece expired and was deleted. But when the retain request came, Storj still wanted to delete the file and the process crashed. And the rest of the pieces were never deleted.

You could probably turn on the debug port and check whether the file walker is still running by using the /mon/ps endpoint.

I recall there were some calls to have this visible in the node’s UI, would be nice for debugging…

Hello @mdw,
Welcome to the forum!

Please try to create this piece using this instruction:

The mapping between satelliteID and blobs you may find there:

Then monitor your logs, if your assumption is correct, the retain process should continue and move all deleted data to the trash. Please note - the retain process (known as a Garbage Collector) needs several loops to move all deleted data to the trash due to how the Bloom’s filter works.

Hello!

Do not worry about “retain - failed to delete piece” messages. This does not mean the satellite expects you to have that piece! It means that the filewalker found that piece while it was traversing your storage directory, and identified it as garbage, but then was already deleted by the time the garbage collector tried to delete it. We log this situation as a warning because it could be a sign of something wrong in the code if it happens very frequently, but it is fine if it only happens once in a while.

How did you identify that the process crashed? Do you have a log of what was output when it crashed? If it crashes when failing to delete a piece, that is definitely a bug we’ll want to fix. It should instead just log a warning and carry on.

trash/ukfu6bhbboxilvt7jrwlqk7y2Tapb5d2r2tsmj2sjxvw5qaaaaaa This folder has 1024 sub folders, only a few folders have files of tens of megabytes, most of which are 0. In the past, after completing garbage collection normally, almost all 1024 folders have files of tens of megabytes

The received US1 satellite has a “Filter Size” of 2097155 nodes, and there is a significant difference between the disk space reported by the satellite and the actual disk space used locally. I have several new nodes, and the received filter size is not 2097155. On these nodes, the disk space reported by the satellite is basically consistent with the actual disk space used locally

I never saw a success message from that retain request/for the satellite. That is why i assumed that it must have not finished.

It could also just be that the piece was deleted early yesterday because my country switched to winter time and the real reason is something else. I will check the logs again today and then do the debug things suggested in this thread.

1 Like

I have looked trough the logs again. Creating the “missing” piece didn’t help. Instead i see the same issue now with a different piece:

First the node started a retain request:

e[36mstoragenode    |e[0m 2023-11-01T13:34:47Z	INFO	retain	Prepared to run a Retain request.	{"process": "storagenode", "Created Before": "2023-10-25T17:59:59Z", "Filter Size": 2097155, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}

Then the collector deleted an expired piece with the id TQZ7KXLDAENJSJLTXCPAIGRD6MYZODAR6YQRGJRJD3424TPIL2GA:

e[36mstoragenode    |e[0m 2023-11-02T01:55:09Z	INFO	collector	deleted expired piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "TQZ7KXLDAENJSJLTXCPAIGRD6MYZODAR6YQRGJRJD3424TPIL2GA"}

A few hours later, the retain request failed to delete the same piece:

e[36mstoragenode    |e[0m 2023-11-02T05:53:24Z	WARN	retain	failed to delete piece	{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "TQZ7KXLDAENJSJLTXCPAIGRD6MYZODAR6YQRGJRJD3424TPIL2GA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

Looking trough the logs since last sunday, i found three retain requests, but only two summaries (“Moved pieces to trash during retain”). Those don’t contain the satellite id, but they’re close to the retain requests from the other satellites (within < 45min).

It might not necessarily be that the missing piece is the issue itself, there are other things i noticed:

  • The filter size for US1 and EU1 is always "Filter Size": 2097155. There is probably a limit for that
  • But the EU1 retain request finished within 45 minutes, the US1 retain request had the error message more than half a day later.
  • The piece i recreated the last time has not been touched/removed

So the US1 retain request is very slow, always delivers exactly one piece error and never finishes. The pieces that are missing either are expired shortly before the retain request runs or while its still running.

Is it BTRFS somehow?

No, it’s EXT4 on a LUKS/dm-crypt encrypted RAID6.

If it was the hardware/file system, i would expect to have issues with more than just one satellite.
The EU1 satellite stores twice as much data, but has no issues.

This is related exclusively to the filesystem. If the garbage collector is unable to finish its work, the deleted data will remain on the disk.
I also see an error on my node

2023-11-01T14:36:45Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-10-25T17:59:59Z", "Filter Size": 2097155, "Satellite ID":"12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-11-02T05:03:31Z   WARN    retain  failed to delete piece  {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

but not a finishing message, like for other satellites, i.e.

2023-11-02T15:19:50Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-09-19T17:00:07Z", "Filter Size": 364472, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-11-02T15:23:54Z   INFO    retain  Moved pieces to trash during retain {"process": "storagenode", "num deleted": 0, "Retain Status": "enabled"}

The full sequence:

2023-10-28T22:25:29Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-10-24T17:59:59Z", "Filter Size": 1835488, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-10-29T00:53:37Z   INFO    retain  Moved pieces to trash during retain {"process": "storagenode", "num deleted": 140893, "Retain Status": "enabled"}
2023-11-01T14:36:45Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-10-25T17:59:59Z", "Filter Size": 2097155, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-11-02T05:03:31Z   WARN    retain  failed to delete piece  {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-11-02T15:19:50Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-09-19T17:00:07Z", "Filter Size": 364472, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-11-02T15:23:54Z   INFO    retain  Moved pieces to trash during retain {"process": "storagenode", "num deleted": 0, "Retain Status": "enabled"}
2023-11-03T09:15:06Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-10-30T17:59:58Z", "Filter Size": 554450, "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2023-11-03T09:36:32Z   INFO    retain  Moved pieces to trash during retain {"process": "storagenode", "num deleted": 16467, "Retain Status": "enabled"}
2023-11-04T22:49:08Z   INFO    retain  Prepared to run a Retain request. {"process": "storagenode", "Created Before": "2023-10-31T17:59:59Z", "Filter Size": 1847037, "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-11-05T01:21:20Z   INFO    retain  Moved pieces to trash during retain {"process": "storagenode", "num deleted": 95503, "Retain Status": "enabled"}

So, I believe we have a bug in the retain process - it’s silently crashed if the file not found.

And another one:

2023-10-24T20:29:13Z    INFO    piecestore      upload started  {"process": "storagenode", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 588397721856, "Remote Address": "172.18.0.1:42534"}
2023-10-24T20:29:13Z    INFO    piecestore      uploaded        {"process": "storagenode", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 768, "Remote Address": "172.18.0.1:42534"}
2023-11-01T21:22:35Z    INFO    collector       deleted expired piece   {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA"}
2023-11-02T05:03:31Z    WARN    retain  failed to delete piece  {"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "23AJFTOZUDJCEUQRQRDG7FHIM63FKPYTO6XVLUMXBCQCKZ4PSMMA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:404\n\tstorj.io/storj/storagenode/retain.(*Service).trash:364\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces:341\n\tstorj.io/storj/storagenode/retain.(*Service).Run.func2:221\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

The expired piece has been deleted, then the retain process is trying to delete it again and crashes.

1 Like

I have some nodes with same situation but too lazy to do further investigation.

Thank you everyone for the logs! There was indeed a bug where we stopped processing files after failing to move one piece file to trash.

I have a fix in and hopefully it will be in the next release.

4 Likes

Can this be maybe checked for used space filewalker as well?
I have couple of nodes where the actual drive space usage is lower than what the node is reporting, causing disk not to be utilised fully, leaving 0.5TB or so unused.
Tried to run lazy filewalker at startup on these affected nodes, it starts, it does complete for some satellites, it does complete successfully on some nodes, but on some nodes the additional storagenode process is apparently still running in the background (two PIDs) but with no apparent results.
Have two nodes at the moment that have this running on what looks to be US1 blobs for third day already with no apparent I/O load, so suspect it might be stuck.

# zgrep walk /var/log/storj.log-20231105.gz 
2023-11-04T12:13:09Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-11-04T12:13:09Z    INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-11-04T12:13:10Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
2023-11-04T12:13:10Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode"}
2023-11-04T13:03:15Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker completed {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "process": "storagenode", "piecesTotal": 410644946944, "piecesContentSize": 410540187136}
2023-11-04T13:03:15Z    INFO    lazyfilewalker.used-space-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2023-11-04T13:03:15Z    INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-11-04T13:03:15Z    INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2023-11-04T13:03:16Z    INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2023-11-04T13:03:16Z    INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "process": "storagenode"}
2023-11-04T21:35:22Z    INFO    lazyfilewalker.gc-filewalker    starting subprocess     {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-11-04T21:35:22Z    INFO    lazyfilewalker.gc-filewalker    subprocess started      {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2023-11-04T21:35:22Z    INFO    lazyfilewalker.gc-filewalker.subprocess Database started        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode"}
2023-11-04T21:35:22Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "process": "storagenode", "createdBefore": "2023-10-31T17:59:59Z", "bloomFilterSize": 1616843}
2023-11-04T22:09:42Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker completed {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "piecesCount": 2739470, "piecesSkippedCount": 0, "process": "storagenode"}
2023-11-04T22:09:42Z    INFO    lazyfilewalker.gc-filewalker    subprocess finished successfully        {"process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}

# pidof storagenode
3139 2306

Tasks: 126 total,   1 running, 125 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.0 us,  1.0 sy,  0.0 ni, 92.9 id,  0.7 wa,  0.0 hi,  0.3 si,  4.1 st
%Cpu1  :  1.7 us,  1.0 sy,  0.0 ni, 92.7 id,  1.7 wa,  0.0 hi,  0.3 si,  2.6 st
%Cpu2  :  1.3 us,  2.0 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  2.0 st
%Cpu3  :  1.0 us,  1.3 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  1.3 st

@zip: Try attaching to the subprocess with strace. Assuming the subprocess is pid 2306:

sudo strace -p 2306 -f -e trace=file -tt

That should show output anytime any of the threads touches any file. If it’s really stuck, there should be little to no output (maybe just some SIGURGs getting sent, since the Go runtime uses that to schedule threads). If it’s calling stat() on files (which probably shows up as calls to fstatat64) then it’s making progress.

1 Like