1.107.3 Error: collector unable to delete pieces

Alexey · October 9, 2024, 6:55am

As far as I know, it’s not new: Search results for 'WARN collector unable to delete piece' - Storj Community Forum (official) and means that the GC has been faster than TTL collector or the customer deleted the object before its expiration.
And you can set a custom log verbosity:

Vadim · October 10, 2024, 5:46am

But 100GB log files in less than 24h filed with this started only on 1.114.5 so I thing there is more something wrong with that. Previous version worked for 2 weeks without a problem.
So how we can investigate it? for now I turned to ERROR level of log, but even if node go off no error appear in the log, only if download failed or upload failed. even them not a lot.

Alexey · October 10, 2024, 6:24am

You may search for the PieceID from the log in the trash folder:

Just someone deleted pieces before the expiration date, so they were collected by a GC.

arrogantrabbit · October 10, 2024, 6:46am

We understand that. The point is, that this is not an error, not even a warning, and definitely does not deserve 1kb worth of stack shots in the log file every time this happens. Which is multiple times per second, resulting in tens of gigabytes of text garbage on storage noes.

This has been brought up before, and such disregard of storage resources does not look good, let alone making finding anything remotely important in that log file entirely pointless.

The fix requires one line change to reduce change log severity from error to info, and does not deserve lengthy discussion.

Let’s split this thread to its own topic please.

Edit: few last log sizes on my node (rotates daily at midnight):

storj# ls -hl storj.log*(om[1,6])
-rw-------  1 root  wheel   5.6M Oct 10 00:17 storj.log
-rw-------  1 root  wheel    46M Oct 10 00:00 storj.log.0.bz2
-rw-------  1 root  wheel   2.0G Oct  9 00:00 storj.log.1.bz2
-rw-------  1 root  wheel    48M Oct  8 00:00 storj.log.2.bz2
-rw-------  1 root  wheel    46M Oct  7 00:00 storj.log.3.bz2
-rw-------  1 root  wheel    50M Oct  6 00:00 storj.log.4.bz2

See that 2.0GiB bz2 compressed, highly redundant, text file? I’m afraid to decompress it to check the actual size…

Alexey · October 10, 2024, 7:25am

Yes, I know, but it is claimed as a new issue in that (v1.114.6) version.

tankmann · October 17, 2024, 6:57pm

Yeah my other node updated 10 hours ago:

Same story…

Toyoo · October 17, 2024, 9:17pm

My suspicion is that bz2 is probably really, really good at compressing the always byte-for-byte identical stack trace, but then it hits that random 50-character-long piece ID which never repeats and is impossible to predict.

This might be a strange request, but could you actually try recompressing that file into xz? Sth like bzcat storj.log.1.bz2 | xz -9e > storj.log.1.xz will do. I would be really curious whether you’d see the same as what I usually see with some stuff I recompressed this way, i.e., 10× decrease in size going from bzip to xz.

It’s not about Storj at all, I just like collecting data on efficiency of different compression methods for different use cases.

arrogantrabbit · October 18, 2024, 3:28am

compression	GB	B	time
raw	58G	62440008373
`xz -0 -T 48`	2.4G	2582107212	2m, 7 s
`xz -6 -T 48`	2.0G	2133895724	8m, 20s
`xz -9e -T 48`	1.9G	2078745676	2h, 16m, 8s
`bzip2`	2.0G	2121659531	4h, 33m, 25s
`xz -0`	2.4G	2582107212	33m, 50s
`xz -6`	2.0G	2133895724	2h, 58m, 35s

CPU: 2x Intel Xeon E5-2680 V3 2.5Ghz 12 Core (Total 24 cores), clamped to 1.2GHz.

tankmann · October 18, 2024, 6:43am

Is there a fix incoming? Otherwise I’ll play with the log level for a while but I’d rather just keep everything running, just a huge IO waste and wear and tear.

Alexey · October 18, 2024, 7:36am

What kind of fix are you expecting?

This is normal behavior when the client deleted the TTL data before it expired, and the garbage collector moved it to the trash. Now the TTL collector complains that the part it should delete is missing.
The opposite situation can also happen - the TTL collector was faster, and now the GC will complain.

tankmann · October 18, 2024, 12:04pm

I understand the behaviour in general. But: Something must have changed as the logfiles are getting spammed MASSIVELY since the last update, which wasn’t the case before. So my read is - something changed, not a client delete behaviour. It started on both nodes after the update to v1.114.6. - and it hasn’t recovered since and just blasts out logs continuously.

And I was hoping that at some point when it all ran through what it wants to delete but it’s not there any more, that it’ll stop. But it just looks like it logs continuously…

arrogantrabbit · October 18, 2024, 2:34pm

The fix to log level - to make if INFO and not print stack.

edo · October 18, 2024, 3:34pm

It sounds like the increase in log activity may be related to the new TTL flat file storage that was automatically enabled in v1.113. With the recent update (v1.114), the system now checks both the new flat file storage and the previous TTL database, which is likely why it’s finding many expired pieces that were not collected earlier and ended up going through GC instead.

This surge in log messages is probably due to the system catching up on collecting these overdue pieces. Fortunately, this should be a temporary situation, and once all the pieces are properly collected, the log activity should return to normal.

I agree it might be worth considering adjusting the log level for these specific messages, as suggested by many already, but a more extensive fix doesn’t appear to be needed at this stage.

tankmann · October 18, 2024, 7:06pm

Yeah thought the same… hence also let it run.
It actually stopped now, it just took way way way way longer than expected:

More then 3 days, almost 80 hours. Back to normal on one node. The other one is still going through it…

flo82 · October 18, 2024, 10:03pm

two nodes produced 100 GB of WARN logs in a short period of time. This happened after update to 1.114.6:

{“L”:“WARN”,“T”:“2024-10-18T22:00:54Z”,“N”:“collector”,“M”:“unable to delete piece”,“Process”:“storagenode”,“Satellite ID”:“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”,“Piece ID”:“PK5X3TSOGMOFUULENPT7V2REBVWPW6A5KDDZGQK5DT23XN5VDT2Q”,“error”:“pieces error: filestore error: file does not exist”,“errorVerbose”:“pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:126\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:350\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:329\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:375\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:111\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

I’m going to ERROR log only - this huge amount of logging is not helpful at all.

arrogantrabbit · October 18, 2024, 11:09pm

Storagenode file deletion worker threads:

Toyoo · October 18, 2024, 11:15pm

Thank you! It’s clear. I used to have a 60 GB dataset in bzip2 that after recompressing into xz went well below 10 GB. Not here though, I strongly suspect you store 2 GB of piece IDs in that file

Alexey · October 19, 2024, 4:07am

Please submit a PR or report an issue on our GitHub, because I didn’t see a change for that.

Alexey · October 19, 2024, 4:09am

You may use a custom log level for this chore, see --log.custom-level or change it temporary via debug port: Guide to debug my storage node, uplink, s3 gateway, satellite.

snorkel · October 19, 2024, 5:27am

This is an old problem. I use this log custom level to get rid of it.