Audits failing / ERROR in log

Hi there,

this seems not to be a new issue according my search but I could not find a solution yet.

One of my nodes reports with the earning.py script “Audits failing” for 4 satellites since a while and it is not getting better over time. A search in the logfile shows: “ERROR piecedeleter could not send delete piece to trash” for multiple entires (recurring)

This node reports lower egress compared to two other nodes and I’m wondering if I should simply shut it down since I can not (partially) graceful exit (younger than 15 months).

The node is in a docker on truenas scale.

Thanks in advance for any hint which might help to solve this before I decide to shutdown the node.

TMB

The line reading “ERROR piecedeleter could not send piece to trash” should include the specific OS error that was encountered when the move was attempted, and the satellite and piece IDs for that piece. As far as I know, the only things that can cause this error are filesystem errors and disk errors.

Consider doing a filesystem check on the volume holding your storage node data.

2 Likes

The more detailed error description is

{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "VRXQYTG2J4GAZVTJYAAKXFKGO4F5JO4FCKCI7ZZYOYV7BCNJRCNA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:396\n\tstorj.io/storj/storagenode/pieces.(*Deleter).deleteOrTrash:185\n\tstorj.io/storj/storagenode/pieces.(*Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(*Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

This is a docker container mounting a ZFS dataset. A filesystem check is not available as far as I know. This reads more like a database inconsistency which could have incurred due to some downtime when I had to reboot the truenas scale after an upgrade which took a bit longer.

You’re suffering data loss. It clearly says the file doesn’t exist.
So, are your sure the mounts are right and so on?

The error is only on some data chunks. There are many successful downloads and uploads in the logfile. So the node seems to work but is not free of errors and with audits failing.

It’s unlikely to be a database inconsistency. The only database involved is the satellite’s metadata database, and there would have to be some sort of corruption for it to tell your node to delete a piece that your node never had. (We’ve had no indication of corruption like that.)

Here’s the likely sequence of events that would lead to that error:

  • Some user U uploaded an object O to satellite S
  • S selected your node to hold one of the pieces of O
  • U uploaded the piece to your node
  • Your node produced a signed message indicating it had received the piece in full and stored it successfully
  • S verified that signature and included your node in the list of pieces and their locations for the segments of O
  • User U told S to delete O
  • S looked up the pieces and locations for the segments of O, and found your node
  • S told your node to delete the piece
  • Your node could not find the piece

To be clear, the message “could not send delete piece to trash” does not get back to the satellite, and even if it did, it does not affect your node’s reputation. It might be something to worry about, in that it may indicate data loss, but it’s not affecting your reputation directly and it doesn’t have anything to do with audits.

1 Like

You need to search for errors related to GET_AUDIT/GET_REPAIR, all other doesn’t affect the audit score

By the way, does your zfs dataset resides on the pool with redundancy (mirror, raidz, raidz2, etc.)?

I pulled another log this morning after a reboot of the machine.

I see 4 Error messages:

1xPUT
2xGET_REPAIR
1xGET

My zfs dataset resides on a single disk vdev since mirror is discouraged for storj given its redundancy in the architecture. I installed on zfs since i) it is the NAS I run and ii) it has some benefit to transfer a node via a snapshot to another drive if needed - definitely faster than rsync.

I had 4 spare 1TB WD RED 2.5 which I added to my HP DL380 Gen9, one node per drive, I added them step by step after the first node was full. So if I now trash this specific node then only 850GB of storj data is lost (node is 6 months old and full vetted).

May be I shutdown storj completely since I currently feel that I’m wasting time to maintain the nodes. The cash return of this specific small set-up of net 3.4 TB isn’t material and the server can’t be expanded easily with additional / higher capacity drives.

Could you please show them?

2023-06-14T09:32:02.404473999Z 2023-06-14T09:32:02.404Z piecestore download failed {“process”: “storagenode”, “Piece ID”: “YUS7PYCCDXJKAGVNK75FZT6ASPQEJCJUNA552JJFQTIDGB5U6VAQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 0, “Remote Address”: “5.161.74.111:33874”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}

2023-06-14T09:37:18.009051056Z 2023-06-14T09:37:18.008Z ERROR piecestore download failed {“process”: “storagenode”, “Piece ID”: “SPYAEJCEQJSWO6YGE3WK6Q33LINDSPSP2LRI2TRRGMFDTRTSLWZA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET_REPAIR”, “Offset”: 0, “Size”: 0, “Remote Address”: “5.161.44.25:59820”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:655\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:251\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35”}
2023-06-14T09:46:06.758212220Z 2023-06-14T09:46:06.756Z

In one or another way, you lost some files. Any idea how this happened? No other errors in the log?

Don’t know what filesystem it’s using, but first priority is checking/scrubbing the file system. If you’re really low on audits or there’s a big discrepancy between the node size according STORJ and real disk usage (>5%), you should consider to restart your node.

So it’s partial data loss. No idea how much…?

Since it’s without a redundancy, it’s not better than ext4.
Scrub unlikely will help much in this case, but at least it will check an integrity.

Thanks All for your thoughts and recommendations.

Finally, this leads to the conclusion that the data loss is probably because I made a couple of mistakes when I installed my first node and moved it around on my systems. There seems to be no solution to fix this and therefore I decided to shutdown this specific node. The other nodes on my system seem to be ok.

As a side effect I can use the spare disk now for my own storage needs and also free-up memory.

2 Likes