1.107.3 Error: collector unable to delete pieces

As far as I know, it’s not new: Search results for 'WARN collector unable to delete piece' - Storj Community Forum (official) and means that the GC has been faster than TTL collector or the customer deleted the object before its expiration.
And you can set a custom log verbosity:

But 100GB log files in less than 24h filed with this started only on 1.114.5 so I thing there is more something wrong with that. Previous version worked for 2 weeks without a problem.
So how we can investigate it? for now I turned to ERROR level of log, but even if node go off no error appear in the log, only if download failed or upload failed. even them not a lot.

You may search for the PieceID from the log in the trash folder:

Just someone deleted pieces before the expiration date, so they were collected by a GC.

We understand that. The point is, that this is not an error, not even a warning, and definitely does not deserve 1kb worth of stack shots in the log file every time this happens. Which is multiple times per second, resulting in tens of gigabytes of text garbage on storage noes.

This has been brought up before, and such disregard of storage resources does not look good, let alone making finding anything remotely important in that log file entirely pointless.

The fix requires one line change to reduce change log severity from error to info, and does not deserve lengthy discussion.

Let’s split this thread to its own topic please.

Edit: few last log sizes on my node (rotates daily at midnight):

storj# ls -hl storj.log*(om[1,6])
-rw-------  1 root  wheel   5.6M Oct 10 00:17 storj.log
-rw-------  1 root  wheel    46M Oct 10 00:00 storj.log.0.bz2
-rw-------  1 root  wheel   2.0G Oct  9 00:00 storj.log.1.bz2
-rw-------  1 root  wheel    48M Oct  8 00:00 storj.log.2.bz2
-rw-------  1 root  wheel    46M Oct  7 00:00 storj.log.3.bz2
-rw-------  1 root  wheel    50M Oct  6 00:00 storj.log.4.bz2

See that 2.0GiB bz2 compressed, highly redundant, text file? I’m afraid to decompress it to check the actual size…

5 Likes

Yes, I know, but it is claimed as a new issue in that (v1.114.6) version.

1 Like

Yeah my other node updated 10 hours ago:

Same story…

My suspicion is that bz2 is probably really, really good at compressing the always byte-for-byte identical stack trace, but then it hits that random 50-character-long piece ID which never repeats and is impossible to predict.

This might be a strange request, but could you actually try recompressing that file into xz? Sth like bzcat storj.log.1.bz2 | xz -9e > storj.log.1.xz will do. I would be really curious whether you’d see the same as what I usually see with some stuff I recompressed this way, i.e., 10× decrease in size going from bzip to xz.

It’s not about Storj at all, I just like collecting data on efficiency of different compression methods for different use cases.

1 Like
compression GB B time
raw 58G 62440008373
xz -0 -T 48 2.4G 2582107212 2m, 7 s
xz -6 -T 48 2.0G 2133895724 8m, 20s
xz -9e -T 48 1.9G 2078745676 2h, 16m, 8s
bzip2 2.0G 2121659531 4h, 33m, 25s
xz -0 2.4G 2582107212 33m, 50s
xz -6 2.0G 2133895724 2h, 58m, 35s

CPU: 2x Intel Xeon E5-2680 V3 2.5Ghz 12 Core (Total 24 cores), clamped to 1.2GHz.

2 Likes

Is there a fix incoming? Otherwise I’ll play with the log level for a while but I’d rather just keep everything running, just a huge IO waste and wear and tear.

What kind of fix are you expecting?

This is normal behavior when the client deleted the TTL data before it expired, and the garbage collector moved it to the trash. Now the TTL collector complains that the part it should delete is missing.
The opposite situation can also happen - the TTL collector was faster, and now the GC will complain.

I understand the behaviour in general. But: Something must have changed as the logfiles are getting spammed MASSIVELY since the last update, which wasn’t the case before. So my read is - something changed, not a client delete behaviour. It started on both nodes after the update to v1.114.6. - and it hasn’t recovered since and just blasts out logs continuously.

And I was hoping that at some point when it all ran through what it wants to delete but it’s not there any more, that it’ll stop. But it just looks like it logs continuously…

The fix to log level - to make if INFO and not print stack.

It sounds like the increase in log activity may be related to the new TTL flat file storage that was automatically enabled in v1.113. With the recent update (v1.114), the system now checks both the new flat file storage and the previous TTL database, which is likely why it’s finding many expired pieces that were not collected earlier and ended up going through GC instead.

This surge in log messages is probably due to the system catching up on collecting these overdue pieces. Fortunately, this should be a temporary situation, and once all the pieces are properly collected, the log activity should return to normal.

I agree it might be worth considering adjusting the log level for these specific messages, as suggested by many already, but a more extensive fix doesn’t appear to be needed at this stage.

Yeah thought the same… hence also let it run.
It actually stopped now, it just took way way way way longer than expected:

More then 3 days, almost 80 hours. Back to normal on one node. The other one is still going through it…

two nodes produced 100 GB of WARN logs in a short period of time. This happened after update to 1.114.6:

{“L”:“WARN”,“T”:“2024-10-18T22:00:54Z”,“N”:“collector”,“M”:“unable to delete piece”,“Process”:“storagenode”,“Satellite ID”:“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”,“Piece ID”:“PK5X3TSOGMOFUULENPT7V2REBVWPW6A5KDDZGQK5DT23XN5VDT2Q”,“error”:“pieces error: filestore error: file does not exist”,“errorVerbose”:“pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:126\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:350\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:329\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:375\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:111\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

I’m going to ERROR log only - this huge amount of logging is not helpful at all.

Storagenode file deletion worker threads:

7 Likes

Thank you! It’s clear. I used to have a 60 GB dataset in bzip2 that after recompressing into xz went well below 10 GB. Not here though, I strongly suspect you store 2 GB of piece IDs in that file :rofl:

2 Likes

Please submit a PR or report an issue on our GitHub, because I didn’t see a change for that.

You may use a custom log level for this chore, see --log.custom-level or change it temporary via debug port: Guide to debug my storage node, uplink, s3 gateway, satellite.

1 Like

This is an old problem. I use this log custom level to get rid of it.

1 Like