Testing Garbage collector

There are some valid concerns here about how best the Garbage Collection should function. Perhaps the conversation should move to a technical post on the topic. While this topic does discuss some area of the payout process, it may serve its purpose better if it were separate from the other questions regarding payouts.

Iā€™d definitely be interested in running your tests, as long as the test scripts donā€™t risk node disqualification.

I have a feeling some people may be significantly overestimating the impact of this trash folder.

Lets make one thing clear first. Garbage collection has moved data to the trash folder that was already unpaid to begin with. See my posts on payout feedback for this discrepancy. So garbage selection is the solution to this problem. Yes the actual deletion of these files happens 7 days later. So lets call it a delayed solution.

The data in the trash folder has accumulated throughout the lifetime of your node. So the size you see right now is much bigger than what you would see when this garbage collection runs continually from now on.

In the past there were bugs that caused more garbage data to stay behind on your node. These bugs are for the most part long since solved, but the resultinng data is only being cleaned up now. Iā€™m not suggesting there may not still be bugs causing this, but if there are, the priority should be fixing those bugs, not worrying about the 7 day delay. And as @littleskunk already suggested, we can be a part of pointing out those bugs, whether itā€™s reporting back on findings or diving into the code and finding and fixing possible issues.

The remaining way of collecting garbage data would be if your node was offline when a delete was sent. Thatā€™s your node failing to hold up its end of the bargain and I would say itā€™s reasonable that that node would then suffer the cost of ensuring the wrong data doesnā€™t get deleted by accident.

Soā€¦

  • Unpaid garbage data is now being cleaned up (sounds like a good thing)
  • There is a 7 day delay in order to ensure a mistake in the code doesnā€™t accidentally wipe the entire network (this is what weā€™re arguing about, but keep the following in mind)
  • Prior bugs may have caused more garbage data in the past, but wonā€™t do so in the future (limits the impact)
  • With these bugs solved the only reason for collecting garbage data is the fault of the storagenode for being offline (SNO is to blame for the remaining impact)

So lets get this argument back on the rails. First off, we have a lot of old mess to clean up. This kind of comes with the territory when participating in betas. But the expectation is that if your node is online 24/7 the size of this trash folder drops to 0 or very close to it. If that happens, there is no more argument here, because nobody would care anymore. If there is still data in there in about a weeks time, perhaps we can argue about that then. There could be two causes, remaining bugs or offline nodes. In the first case, lets try to be constructive and give as much information we can find about those bugs so the good devs can squash them. If itā€™s the second caseā€¦ I donā€™t know what to tell you, seems you kind of brought it on yourself by being offline. So lets see how big of a deal this thing really is before claiming itā€™s a massive problem.

9 Likes

My trash folder has dropped to 0 bytes as I was already expecting. (Though I did notice the empty folders stay behind)

I think we can all retract claws.

1 Like

Iā€™m sure everything will work just fineā€¦

However, my trash folder still has 97GB of trash. Based on the 7 day trash empty time, I should see stuff disappear in the next 24 hoursā€¦

A little one liner for checking 14 days of trashā€¦

for i in {1..14}; do c=$(find trash/ -mindepth 1 -mtime -$i |wc -l); echo "Day $i: $c"; done

Like most things I write for my own purposes, itā€™s not exactly pretty. The 8th day is really the 7th day based on the way find calculates the mtime option.

Hereā€™s my output for my node:

Day 1: 0
Day 2: 0
Day 3: 0
Day 4: 988
Day 5: 988
Day 6: 988
Day 7: 25066
Day 8: 61032
Day 9: 61032 
Day 10: 61032
Day 11: 61033
Day 12: 61033  
Day 13: 61033
Day 14: 61033

Based on this output, there has been no recent additions to my trash folder and the oldest files should be purged within 24 hours from this post. The total number of files is, of course, the largest number which includes all newer files as well. So, I expect 61032-25066 = 35966 files to be removed in the next 24 hours.

if you add -type f to find it wonā€™t count directories.

1 Like

Thanks! Nice catchā€¦ 22 dotsā€¦

My node is just done collecting 30GB of garbage again. Iā€™m curious where this garbage came from though, the node has never been offline since the last time garbage was collected apart from updates which take only a second or two. It canā€™t have missed 30GB worth of deletes. Was this just left over from last time or actually accumulated in the mean time. (The latter would worry me a little)

In your logfile you should see which satellite it was. Keep in mind that bloomfilter have a false positive rate. On the first run the storage node will not catch all the garbage pieces. On the second run the bloomfilter starts with a different random seed. It might or might not catch the pieces it missed on the first run. We have to wait for a few executions before we can see what the baseline will be.

2 Likes

Ahh, that would explain it. It was the stefan-benten satellite. I guess that one had a lot to clean up. Last time it was almost 20x this amount.

I knew I should have read up more about these bloom filters. Wasnā€™t aware of the false positive rate, but this is clearly intended behavior.
If anyone else wants to read up:

2 Likes

I would think this would only become an issue if say garbage collector olds onto 500 to 1TB of data for 7 days as that would cost the node operator to loose any income if there nodes were full because of garbage collector. Maybe there should be put in place a way to be paid for holding a certain amount of data. If the garbage collector holds a lot of data for 7 days but then it finds more data which clogs your node for even longer. It would be like your storage node is full but not really full, so it prevents your node from getting anymore data. This is worst case, but even if it were 250gig or 300gigs of data it would still pervent you from getting any data which prevents you to make any income running the node.

That will never happen again after this big batch is cleaned up. Because of the false positives in the bloom filter it takes a few times to clean everything, but the remaining data gets a LOT smaller with every iteration. Give it a month and see where we are at that point. If this thing is running as it should, I doubt weā€™ll ever see big cleanups like this ever again unless your node has been offline for a significant amount of time.

2 Likes

And in production, this would cause disqualification anyway so itā€™s only even a potential concern during the beta.

Put another way, having your node offline is shooting yourself in the foot, but the effect is different during beta and production. In beta, youā€™ll miss deletes and the GC will have to clean up later. In production, youā€™ll just get kicked out of the network.

I magically have 450gigs extra gigs free after a long garbage collection process

Iā€™m not entirely sure this is related, but I noticed in my log a delete failed with context cancelled and pieceinfodb error. I assume this one will be caught by the next garbage collection cycle. I just wonder why this would happen. The error kind of lacks detail.

2020-01-03T15:51:17.880Z ERROR piecestore delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "AASZJ4S6UEURWI4DXVQECFS6QW7OPLGUB7OMOZOUYQHP5ET3QHOA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Edit: I have 177 of these since Jan 1st. I had 24 in all of December. But Iā€™m repairing my RAID array right now, so additional load may play a role in the higher numbers right now.

Also got a bunch of those delete failed, from all four satellites, and just after the piece has been uploaded :

root@server030:/disk002/logs/storagenode# grep "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA" *.log     
2020-01-03T08:09:33.298+0100        INFO    piecestore      upload started  {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:33.784+0100        INFO    piecestore      uploaded        {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:51.246+0100        ERROR   piecestore      delete failed   {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA" *.log
2020-01-03T07:53:32.197+0100        INFO    piecestore      upload started  {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:53:38.736+0100        INFO    piecestore      uploaded        {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:54:27.680+0100        ERROR   piecestore      delete failed   {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA" *.log
2020-01-02T06:43:07.909+0100        INFO    piecestore      upload started  {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:08.128+0100        INFO    piecestore      uploaded        {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:09.562+0100        INFO    piecestore      download started        {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:10.006+0100        INFO    piecestore      downloaded      {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:11.531+0100        ERROR   piecestore      delete failed   {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA" *.log
2020-01-03T07:25:56.521+0100        INFO    piecestore      upload started  {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:01.630+0100        INFO    piecestore      uploaded        {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:45.852+0100        ERROR   piecestore      delete failed   {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

88 delete errors since 01-01-2020, running latest version and no RAID is rebuilding :wink: :

root@server030:/disk002/logs/storagenode# grep "delete failed" *.log | grep -c 'pieces error: context canceled; v0pieceinfodb error'
88
root@server030:/disk002/logs/storagenode# head -q  *.log | grep "running on version" | sort | head -n1
2020-01-01T18:34:36.126+0100    INFO    version running on version v0.28.4

I got 115 of those errors too since 2019-12-18, not sure when the first one occured.

I got this too:

ERROR   piecestore      delete failed   {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "DXGWUFUJYNENLF4AES66GGZD2NQWY4XAJS3GREWJMPXTBIAFN7UQ", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Context canceled means my node was not among the fastest set of nodes right?
Is this applicable to delete messages?

Thanks!

Logically speaking it is not. With uploads and downloads, the action is intentionally over-provisioned so that if a few transfers fail or are very slow, it doesnā€™t impact the transfer as a whole. For deletes this really doesnā€™t apply as the data should be deleted from all nodes. Now Iā€™m sure garbage collection will clean up these pieces eventually, but thatā€™s a fallback and doesnā€™t mean the initial problem doesnā€™t need to be solved.

yeahā€¦ that is exactly how i thought it should work.
So i am surprised by this message.

deletions should be network wide.

i know a put can fail now and then, that is not such a big deal.

Is there anything i can do to help with this? logs?

Thanks!