Internal errors - unexpected EOF, Collector can't find files, and untrusted satellites

Tempest · August 18, 2024, 8:11pm

Help I’m getting a ton of these errors 500+ of them just yesterday alone. Only a few untrusted satellite errors though. Node is stable v111.0-rc, was also happening on v110.3. Running three striped 8TB CMR drives with 2.1TB space left yet dashboard says I’m only using 14TB. Audit/Suspension is 100% for all satellites.

Thanks in advance.

2024-08-17T16:40:58-04:00 ERROR piecestore upload failed {“Piece ID”: “5PTJTDJVRKPCNS67NNW2W5UVFXIYBSOCRSN6EHI3H4ASKB74H6YQ”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “PUT”, “Remote Address”: “79.127.205.241:58124”, “Size”: 1376256, “error”: “manager closed: unexpected EOF”, “errorVerbose”: “manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:230”}

2024-08-17T16:25:45-04:00 ERROR piecestore upload failed {“Piece ID”: “KCC72GPQTYZVNQCK7E7HAQVZK43OTYD57HP5ZHF2OYXX3HDPGUNQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Remote Address”: “79.127.226.98:50730”, “Size”: 851968, “error”: “manager closed: unexpected EOF”, “errorVerbose”: “manager closed: unexpected EOF\n\tgithub.com/jtolio/noiseconn.(*Conn).readMsg:225\n\tgithub.com/jtolio/noiseconn.(*Conn).Read:171\n\tstorj.io/drpc/drpcwire.(*Reader).read:68\n\tstorj.io/drpc/drpcwire.(*Reader).ReadPacketUsing:113\n\tstorj.io/drpc/drpcmanager.(*Manager).manageReader:230”}

2024-08-17T01:49:04-04:00 WARN console:service unable to get Satellite URL {“Satellite ID”: “118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW”, “error”: “console: trust: satellite is untrusted”, “errorVerbose”: “console: trust: satellite is untrusted\n\tstorj.io/storj/storagenode/trust.init:29\n\truntime.doInit:6527\n\truntime.doInit:6504\n\truntime.doInit:6504\n\truntime.doInit:6504\n\truntime.doInit:6504\n\truntime.main:233”}

2024-08-16T14:33:36-04:00 WARN collector unable to delete piece {“Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “OZUG4WKHCUXZHWCUFSU6VHLXGZA26K5SZLAZETV6XRUJR5DVM5AQ”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:127\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:340\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:320\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:362\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:112\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/collector.(*Service).Run:64\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

EasyRhino · August 18, 2024, 9:40pm

unexpected EOF is normal, it means the client aborted or finished the upload or download without you.

Unable to delete piece is also crazy common. we now have test data that has a “TTL” time to live and auto deletes after, say, 30 days. However, it could get deleted even sooner by the normal garbage collection process. So the TTL tries to delete a file that’s already deleted. the spam in the log is annoying but harmless.

I don’t know about the untrusted satellite… any chance your node is from really old when there used to be more operational satellites?

for used space not being accurate. Common problem. Probably a bug. But the best situation is to restart your node and let it finish the used space filewalkers (one for each satellite). Could take days. that will get you the most accurate estimate.

Also if your node is 24TB then you may have trouble being above the max recommended node size. it means that the garbage collection bloom filters are not gonna do a good job of finding all your trash.

3 separate 8TB nodes would probably perform better.

Tempest · August 19, 2024, 2:27am

Yes it’s very old May 2019 it shows in my payouts. So other people beating me to the punch? Did something change that used to be an info level event, showing it canceled if I remember correctly but my log level is now set to warn to ignore those. How about the collector error I have hundreds of those too, trying to delete a file that no longer exists. I didn’t touch any files.

Alexey · August 19, 2024, 3:49am

donald.m.motsinger · August 19, 2024, 9:09am

This is a disaster waiting to happen. If one disk fails all data is gone. You’re better off with one node per disk.

Tempest · August 19, 2024, 3:00pm

I’m using enterprise level hds, I trust these drives 3x more then a consumer grade one.

EasyRhino · August 19, 2024, 3:37pm

okay, but still…

three independent nodes, one per drive, would give you both better storj relevant performnce (raid0 is good for bandwidth, but storj needs random iops), and also less damage if you had a hard drive crash. (I almost had a dead drive yesterday and was sweating bullets).

Plus you’re over the current maximum practical size of a storj node.

Admittedly, you may not be able to practically migrate from the current setup to an independent disk setup. But don’t add more disks to the array!