this seems not to be a new issue according my search but I could not find a solution yet.
One of my nodes reports with the earning.py script “Audits failing” for 4 satellites since a while and it is not getting better over time. A search in the logfile shows: “ERROR piecedeleter could not send delete piece to trash” for multiple entires (recurring)
This node reports lower egress compared to two other nodes and I’m wondering if I should simply shut it down since I can not (partially) graceful exit (younger than 15 months).
The node is in a docker on truenas scale.
Thanks in advance for any hint which might help to solve this before I decide to shutdown the node.
The line reading “ERROR piecedeleter could not send piece to trash” should include the specific OS error that was encountered when the move was attempted, and the satellite and piece IDs for that piece. As far as I know, the only things that can cause this error are filesystem errors and disk errors.
Consider doing a filesystem check on the volume holding your storage node data.
{"process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "VRXQYTG2J4GAZVTJYAAKXFKGO4F5JO4FCKCI7ZZYOYV7BCNJRCNA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:110\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:245\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Trash:290\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:396\n\tstorj.io/storj/storagenode/pieces.(*Deleter).deleteOrTrash:185\n\tstorj.io/storj/storagenode/pieces.(*Deleter).work:135\n\tstorj.io/storj/storagenode/pieces.(*Deleter).Run.func1:72\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
This is a docker container mounting a ZFS dataset. A filesystem check is not available as far as I know. This reads more like a database inconsistency which could have incurred due to some downtime when I had to reboot the truenas scale after an upgrade which took a bit longer.
The error is only on some data chunks. There are many successful downloads and uploads in the logfile. So the node seems to work but is not free of errors and with audits failing.
It’s unlikely to be a database inconsistency. The only database involved is the satellite’s metadata database, and there would have to be some sort of corruption for it to tell your node to delete a piece that your node never had. (We’ve had no indication of corruption like that.)
Here’s the likely sequence of events that would lead to that error:
Some user U uploaded an object O to satellite S
S selected your node to hold one of the pieces of O
U uploaded the piece to your node
Your node produced a signed message indicating it had received the piece in full and stored it successfully
S verified that signature and included your node in the list of pieces and their locations for the segments of O
User U told S to delete O
S looked up the pieces and locations for the segments of O, and found your node
S told your node to delete the piece
Your node could not find the piece
To be clear, the message “could not send delete piece to trash” does not get back to the satellite, and even if it did, it does not affect your node’s reputation. It might be something to worry about, in that it may indicate data loss, but it’s not affecting your reputation directly and it doesn’t have anything to do with audits.
I pulled another log this morning after a reboot of the machine.
I see 4 Error messages:
1xPUT
2xGET_REPAIR
1xGET
My zfs dataset resides on a single disk vdev since mirror is discouraged for storj given its redundancy in the architecture. I installed on zfs since i) it is the NAS I run and ii) it has some benefit to transfer a node via a snapshot to another drive if needed - definitely faster than rsync.
I had 4 spare 1TB WD RED 2.5 which I added to my HP DL380 Gen9, one node per drive, I added them step by step after the first node was full. So if I now trash this specific node then only 850GB of storj data is lost (node is 6 months old and full vetted).
May be I shutdown storj completely since I currently feel that I’m wasting time to maintain the nodes. The cash return of this specific small set-up of net 3.4 TB isn’t material and the server can’t be expanded easily with additional / higher capacity drives.
In one or another way, you lost some files. Any idea how this happened? No other errors in the log?
Don’t know what filesystem it’s using, but first priority is checking/scrubbing the file system. If you’re really low on audits or there’s a big discrepancy between the node size according STORJ and real disk usage (>5%), you should consider to restart your node.
Finally, this leads to the conclusion that the data loss is probably because I made a couple of mistakes when I installed my first node and moved it around on my systems. There seems to be no solution to fix this and therefore I decided to shutdown this specific node. The other nodes on my system seem to be ok.
As a side effect I can use the spare disk now for my own storage needs and also free-up memory.