There are some valid concerns here about how best the Garbage Collection should function. Perhaps the conversation should move to a technical post on the topic. While this topic does discuss some area of the payout process, it may serve its purpose better if it were separate from the other questions regarding payouts.
Iād definitely be interested in running your tests, as long as the test scripts donāt risk node disqualification.
I have a feeling some people may be significantly overestimating the impact of this trash folder.
Lets make one thing clear first. Garbage collection has moved data to the trash folder that was already unpaid to begin with. See my posts on payout feedback for this discrepancy. So garbage selection is the solution to this problem. Yes the actual deletion of these files happens 7 days later. So lets call it a delayed solution.
The data in the trash folder has accumulated throughout the lifetime of your node. So the size you see right now is much bigger than what you would see when this garbage collection runs continually from now on.
In the past there were bugs that caused more garbage data to stay behind on your node. These bugs are for the most part long since solved, but the resultinng data is only being cleaned up now. Iām not suggesting there may not still be bugs causing this, but if there are, the priority should be fixing those bugs, not worrying about the 7 day delay. And as @littleskunk already suggested, we can be a part of pointing out those bugs, whether itās reporting back on findings or diving into the code and finding and fixing possible issues.
The remaining way of collecting garbage data would be if your node was offline when a delete was sent. Thatās your node failing to hold up its end of the bargain and I would say itās reasonable that that node would then suffer the cost of ensuring the wrong data doesnāt get deleted by accident.
Soā¦
- Unpaid garbage data is now being cleaned up (sounds like a good thing)
- There is a 7 day delay in order to ensure a mistake in the code doesnāt accidentally wipe the entire network (this is what weāre arguing about, but keep the following in mind)
- Prior bugs may have caused more garbage data in the past, but wonāt do so in the future (limits the impact)
- With these bugs solved the only reason for collecting garbage data is the fault of the storagenode for being offline (SNO is to blame for the remaining impact)
So lets get this argument back on the rails. First off, we have a lot of old mess to clean up. This kind of comes with the territory when participating in betas. But the expectation is that if your node is online 24/7 the size of this trash folder drops to 0 or very close to it. If that happens, there is no more argument here, because nobody would care anymore. If there is still data in there in about a weeks time, perhaps we can argue about that then. There could be two causes, remaining bugs or offline nodes. In the first case, lets try to be constructive and give as much information we can find about those bugs so the good devs can squash them. If itās the second case⦠I donāt know what to tell you, seems you kind of brought it on yourself by being offline. So lets see how big of a deal this thing really is before claiming itās a massive problem.
My trash folder has dropped to 0 bytes as I was already expecting. (Though I did notice the empty folders stay behind)
I think we can all retract claws.
Iām sure everything will work just fineā¦
However, my trash folder still has 97GB of trash. Based on the 7 day trash empty time, I should see stuff disappear in the next 24 hoursā¦
A little one liner for checking 14 days of trashā¦
for i in {1..14}; do c=$(find trash/ -mindepth 1 -mtime -$i |wc -l); echo "Day $i: $c"; done
Like most things I write for my own purposes, itās not exactly pretty. The 8th day is really the 7th day based on the way find calculates the mtime
option.
Hereās my output for my node:
Day 1: 0
Day 2: 0
Day 3: 0
Day 4: 988
Day 5: 988
Day 6: 988
Day 7: 25066
Day 8: 61032
Day 9: 61032
Day 10: 61032
Day 11: 61033
Day 12: 61033
Day 13: 61033
Day 14: 61033
Based on this output, there has been no recent additions to my trash folder and the oldest files should be purged within 24 hours from this post. The total number of files is, of course, the largest number which includes all newer files as well. So, I expect 61032-25066 = 35966
files to be removed in the next 24 hours.
if you add -type f
to find it wonāt count directories.
Thanks! Nice catch⦠22 dotsā¦
My node is just done collecting 30GB of garbage again. Iām curious where this garbage came from though, the node has never been offline since the last time garbage was collected apart from updates which take only a second or two. It canāt have missed 30GB worth of deletes. Was this just left over from last time or actually accumulated in the mean time. (The latter would worry me a little)
In your logfile you should see which satellite it was. Keep in mind that bloomfilter have a false positive rate. On the first run the storage node will not catch all the garbage pieces. On the second run the bloomfilter starts with a different random seed. It might or might not catch the pieces it missed on the first run. We have to wait for a few executions before we can see what the baseline will be.
Ahh, that would explain it. It was the stefan-benten satellite. I guess that one had a lot to clean up. Last time it was almost 20x this amount.
I knew I should have read up more about these bloom filters. Wasnāt aware of the false positive rate, but this is clearly intended behavior.
If anyone else wants to read up:
I would think this would only become an issue if say garbage collector olds onto 500 to 1TB of data for 7 days as that would cost the node operator to loose any income if there nodes were full because of garbage collector. Maybe there should be put in place a way to be paid for holding a certain amount of data. If the garbage collector holds a lot of data for 7 days but then it finds more data which clogs your node for even longer. It would be like your storage node is full but not really full, so it prevents your node from getting anymore data. This is worst case, but even if it were 250gig or 300gigs of data it would still pervent you from getting any data which prevents you to make any income running the node.
That will never happen again after this big batch is cleaned up. Because of the false positives in the bloom filter it takes a few times to clean everything, but the remaining data gets a LOT smaller with every iteration. Give it a month and see where we are at that point. If this thing is running as it should, I doubt weāll ever see big cleanups like this ever again unless your node has been offline for a significant amount of time.
And in production, this would cause disqualification anyway so itās only even a potential concern during the beta.
Put another way, having your node offline is shooting yourself in the foot, but the effect is different during beta and production. In beta, youāll miss deletes and the GC will have to clean up later. In production, youāll just get kicked out of the network.
I magically have 450gigs extra gigs free after a long garbage collection process
Iām not entirely sure this is related, but I noticed in my log a delete failed with context cancelled and pieceinfodb error. I assume this one will be caught by the next garbage collection cycle. I just wonder why this would happen. The error kind of lacks detail.
2020-01-03T15:51:17.880Z ERROR piecestore delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "AASZJ4S6UEURWI4DXVQECFS6QW7OPLGUB7OMOZOUYQHP5ET3QHOA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
Edit: I have 177 of these since Jan 1st. I had 24 in all of December. But Iām repairing my RAID array right now, so additional load may play a role in the higher numbers right now.
Also got a bunch of those delete failed, from all four satellites, and just after the piece has been uploaded :
root@server030:/disk002/logs/storagenode# grep "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA" *.log
2020-01-03T08:09:33.298+0100 INFO piecestore upload started {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:33.784+0100 INFO piecestore uploaded {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:51.246+0100 ERROR piecestore delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
root@server030:/disk002/logs/storagenode# grep "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA" *.log
2020-01-03T07:53:32.197+0100 INFO piecestore upload started {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:53:38.736+0100 INFO piecestore uploaded {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:54:27.680+0100 ERROR piecestore delete failed {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
root@server030:/disk002/logs/storagenode# grep "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA" *.log
2020-01-02T06:43:07.909+0100 INFO piecestore upload started {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:08.128+0100 INFO piecestore uploaded {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:09.562+0100 INFO piecestore download started {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:10.006+0100 INFO piecestore downloaded {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:11.531+0100 ERROR piecestore delete failed {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
root@server030:/disk002/logs/storagenode# grep "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA" *.log
2020-01-03T07:25:56.521+0100 INFO piecestore upload started {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:01.630+0100 INFO piecestore uploaded {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:45.852+0100 ERROR piecestore delete failed {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
88 delete errors since 01-01-2020, running latest version and no RAID is rebuilding :
root@server030:/disk002/logs/storagenode# grep "delete failed" *.log | grep -c 'pieces error: context canceled; v0pieceinfodb error'
88
root@server030:/disk002/logs/storagenode# head -q *.log | grep "running on version" | sort | head -n1
2020-01-01T18:34:36.126+0100 INFO version running on version v0.28.4
I got 115 of those errors too since 2019-12-18, not sure when the first one occured.
I got this too:
ERROR piecestore delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "DXGWUFUJYNENLF4AES66GGZD2NQWY4XAJS3GREWJMPXTBIAFN7UQ", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}
Context canceled means my node was not among the fastest set of nodes right?
Is this applicable to delete messages?
Thanks!
Logically speaking it is not. With uploads and downloads, the action is intentionally over-provisioned so that if a few transfers fail or are very slow, it doesnāt impact the transfer as a whole. For deletes this really doesnāt apply as the data should be deleted from all nodes. Now Iām sure garbage collection will clean up these pieces eventually, but thatās a fallback and doesnāt mean the initial problem doesnāt need to be solved.
yeah⦠that is exactly how i thought it should work.
So i am surprised by this message.
deletions should be network wide.
i know a put can fail now and then, that is not such a big deal.
Is there anything i can do to help with this? logs?
Thanks!