Testing Garbage collector

Knowledge · December 26, 2019, 6:54pm

There are some valid concerns here about how best the Garbage Collection should function. Perhaps the conversation should move to a technical post on the topic. While this topic does discuss some area of the payout process, it may serve its purpose better if it were separate from the other questions regarding payouts.

anon27637763 · December 26, 2019, 7:18pm

I’d definitely be interested in running your tests, as long as the test scripts don’t risk node disqualification.

BrightSilence · December 26, 2019, 10:42pm

I have a feeling some people may be significantly overestimating the impact of this trash folder.

Lets make one thing clear first. Garbage collection has moved data to the trash folder that was already unpaid to begin with. See my posts on payout feedback for this discrepancy. So garbage selection is the solution to this problem. Yes the actual deletion of these files happens 7 days later. So lets call it a delayed solution.

The data in the trash folder has accumulated throughout the lifetime of your node. So the size you see right now is much bigger than what you would see when this garbage collection runs continually from now on.

In the past there were bugs that caused more garbage data to stay behind on your node. These bugs are for the most part long since solved, but the resultinng data is only being cleaned up now. I’m not suggesting there may not still be bugs causing this, but if there are, the priority should be fixing those bugs, not worrying about the 7 day delay. And as @littleskunk already suggested, we can be a part of pointing out those bugs, whether it’s reporting back on findings or diving into the code and finding and fixing possible issues.

The remaining way of collecting garbage data would be if your node was offline when a delete was sent. That’s your node failing to hold up its end of the bargain and I would say it’s reasonable that that node would then suffer the cost of ensuring the wrong data doesn’t get deleted by accident.

So…

Unpaid garbage data is now being cleaned up (sounds like a good thing)
There is a 7 day delay in order to ensure a mistake in the code doesn’t accidentally wipe the entire network (this is what we’re arguing about, but keep the following in mind)
Prior bugs may have caused more garbage data in the past, but won’t do so in the future (limits the impact)
With these bugs solved the only reason for collecting garbage data is the fault of the storagenode for being offline (SNO is to blame for the remaining impact)

So lets get this argument back on the rails. First off, we have a lot of old mess to clean up. This kind of comes with the territory when participating in betas. But the expectation is that if your node is online 24/7 the size of this trash folder drops to 0 or very close to it. If that happens, there is no more argument here, because nobody would care anymore. If there is still data in there in about a weeks time, perhaps we can argue about that then. There could be two causes, remaining bugs or offline nodes. In the first case, lets try to be constructive and give as much information we can find about those bugs so the good devs can squash them. If it’s the second case… I don’t know what to tell you, seems you kind of brought it on yourself by being offline. So lets see how big of a deal this thing really is before claiming it’s a massive problem.

BrightSilence · December 29, 2019, 8:20pm

My trash folder has dropped to 0 bytes as I was already expecting. (Though I did notice the empty folders stay behind)

I think we can all retract claws.

anon27637763 · December 29, 2019, 9:10pm

I’m sure everything will work just fine…

However, my trash folder still has 97GB of trash. Based on the 7 day trash empty time, I should see stuff disappear in the next 24 hours…

A little one liner for checking 14 days of trash…

for i in {1..14}; do c=$(find trash/ -mindepth 1 -mtime -$i |wc -l); echo "Day $i: $c"; done

Like most things I write for my own purposes, it’s not exactly pretty. The 8th day is really the 7th day based on the way find calculates the mtime option.

Here’s my output for my node:

Day 1: 0
Day 2: 0
Day 3: 0
Day 4: 988
Day 5: 988
Day 6: 988
Day 7: 25066
Day 8: 61032
Day 9: 61032 
Day 10: 61032
Day 11: 61033
Day 12: 61033  
Day 13: 61033
Day 14: 61033

Based on this output, there has been no recent additions to my trash folder and the oldest files should be purged within 24 hours from this post. The total number of files is, of course, the largest number which includes all newer files as well. So, I expect 61032-25066 = 35966 files to be removed in the next 24 hours.

donald.m.motsinger · December 30, 2019, 12:55am

if you add -type f to find it won’t count directories.

anon27637763 · December 30, 2019, 1:27am

Thanks! Nice catch… 22 dots…

BrightSilence · December 31, 2019, 9:18am

My node is just done collecting 30GB of garbage again. I’m curious where this garbage came from though, the node has never been offline since the last time garbage was collected apart from updates which take only a second or two. It can’t have missed 30GB worth of deletes. Was this just left over from last time or actually accumulated in the mean time. (The latter would worry me a little)

littleskunk · December 31, 2019, 9:25am

In your logfile you should see which satellite it was. Keep in mind that bloomfilter have a false positive rate. On the first run the storage node will not catch all the garbage pieces. On the second run the bloomfilter starts with a different random seed. It might or might not catch the pieces it missed on the first run. We have to wait for a few executions before we can see what the baseline will be.

BrightSilence · December 31, 2019, 9:48am

Ahh, that would explain it. It was the stefan-benten satellite. I guess that one had a lot to clean up. Last time it was almost 20x this amount.

I knew I should have read up more about these bloom filters. Wasn’t aware of the false positive rate, but this is clearly intended behavior.
If anyone else wants to read up:

github.com

storj/storj/blob/72661553754705858ce35e7fa25dce2fdd20f100/docs/blueprints/garbage-collection.md


# Storage Node Garbage Collection
 
## Abstract
This design doc describes how the satellite should notify storage nodes about garbage pieces that they may be holding, and how storage nodes should go about deleting that data.

## Background
When clients move, replace, or delete data, Satellites, or clients on behalf of Satellites, will notify storage nodes that they are no longer required to store that data. 
 In configurations where delete messages are issued by the client, the metadata system (and thus a Satellite, with Satellite reputation on the line) will require proof that deletes were issued to a configurable minimum number of storage nodes. 
 This means that every time data is deleted, storage nodes that are online and reachable will receive notifications right away.
Storage nodes will sometimes be temporarily unavailable and will miss delete messages. 
In these cases, unneeded data is considered garbage.

## Ways to create garbage data:
- Failed or interrupted upload (we used to delete the previously uploaded segments but since the introduction of order limits, we no longer do this.)
- Regular upload where the longtail is canceled (e.g. uploading to 90 pieces, but the slowest 10 are cut, so we end up with 80 relevant)
- Upon deletion
- Upon replacement
- When a satellite makes a repair and drops the node
- When the client stops paying their bills

This file has been truncated. show original

deathlessdd · December 31, 2019, 11:47am

I would think this would only become an issue if say garbage collector olds onto 500 to 1TB of data for 7 days as that would cost the node operator to loose any income if there nodes were full because of garbage collector. Maybe there should be put in place a way to be paid for holding a certain amount of data. If the garbage collector holds a lot of data for 7 days but then it finds more data which clogs your node for even longer. It would be like your storage node is full but not really full, so it prevents your node from getting anymore data. This is worst case, but even if it were 250gig or 300gigs of data it would still pervent you from getting any data which prevents you to make any income running the node.

BrightSilence · December 31, 2019, 12:44pm

That will never happen again after this big batch is cleaned up. Because of the false positives in the bloom filter it takes a few times to clean everything, but the remaining data gets a LOT smaller with every iteration. Give it a month and see where we are at that point. If this thing is running as it should, I doubt we’ll ever see big cleanups like this ever again unless your node has been offline for a significant amount of time.

cdhowie · December 31, 2019, 4:25pm

And in production, this would cause disqualification anyway so it’s only even a potential concern during the beta.

Put another way, having your node offline is shooting yourself in the foot, but the effect is different during beta and production. In beta, you’ll miss deletes and the GC will have to clean up later. In production, you’ll just get kicked out of the network.

deathlessdd · January 1, 2020, 1:45am

I magically have 450gigs extra gigs free after a long garbage collection process

BrightSilence · January 3, 2020, 3:55pm

I’m not entirely sure this is related, but I noticed in my log a delete failed with context cancelled and pieceinfodb error. I assume this one will be caught by the next garbage collection cycle. I just wonder why this would happen. The error kind of lacks detail.

2020-01-03T15:51:17.880Z ERROR piecestore delete failed {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "AASZJ4S6UEURWI4DXVQECFS6QW7OPLGUB7OMOZOUYQHP5ET3QHOA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Edit: I have 177 of these since Jan 1st. I had 24 in all of December. But I’m repairing my RAID array right now, so additional load may play a role in the higher numbers right now.

Th3Van · January 4, 2020, 12:45am

Also got a bunch of those delete failed, from all four satellites, and just after the piece has been uploaded :

root@server030:/disk002/logs/storagenode# grep "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA" *.log     
2020-01-03T08:09:33.298+0100        INFO    piecestore      upload started  {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:33.784+0100        INFO    piecestore      uploaded        {"Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT"}
2020-01-03T08:09:51.246+0100        ERROR   piecestore      delete failed   {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "TYD43JBPMQ4MIEOHX5MJBKLB6FAX7HLO25RLB5JGHAEWGO6XCIHA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA" *.log
2020-01-03T07:53:32.197+0100        INFO    piecestore      upload started  {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:53:38.736+0100        INFO    piecestore      uploaded        {"Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT"}
2020-01-03T07:54:27.680+0100        ERROR   piecestore      delete failed   {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "DKTAXU2BI7FSS3AYVYH4VXBNLPIHUF4EKFXW5KWR3O63Z4QQ55GA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA" *.log
2020-01-02T06:43:07.909+0100        INFO    piecestore      upload started  {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:08.128+0100        INFO    piecestore      uploaded        {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "PUT"}
2020-01-02T06:43:09.562+0100        INFO    piecestore      download started        {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:10.006+0100        INFO    piecestore      downloaded      {"Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2020-01-02T06:43:11.531+0100        ERROR   piecestore      delete failed   {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Piece ID": "DDH742SHUQSQKVZFVWF6MJLSA7ZCK7LSEF2RVV7MKMOMXQYYLWZA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

root@server030:/disk002/logs/storagenode# grep "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA" *.log
2020-01-03T07:25:56.521+0100        INFO    piecestore      upload started  {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:01.630+0100        INFO    piecestore      uploaded        {"Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT"}
2020-01-03T07:26:45.852+0100        ERROR   piecestore      delete failed   {"Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Piece ID": "4NS2KR2Z3RNUB3V7GTX44A2ARONRTADCMUD4WRFJI5SUGDP34PIA", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:162\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

88 delete errors since 01-01-2020, running latest version and no RAID is rebuilding :

root@server030:/disk002/logs/storagenode# grep "delete failed" *.log | grep -c 'pieces error: context canceled; v0pieceinfodb error'
88
root@server030:/disk002/logs/storagenode# head -q  *.log | grep "running on version" | sort | head -n1
2020-01-01T18:34:36.126+0100    INFO    version running on version v0.28.4

kevink · January 4, 2020, 7:24am

I got 115 of those errors too since 2019-12-18, not sure when the first one occured.

mrkeyboardcommando · January 7, 2020, 7:52pm

I got this too:

ERROR   piecestore      delete failed   {"Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Piece ID": "DXGWUFUJYNENLF4AES66GGZD2NQWY4XAJS3GREWJMPXTBIAFN7UQ", "error": "pieces error: context canceled; v0pieceinfodb error: context canceled", "errorVerbose": "pieces error: context canceled; v0pieceinfodb error: context canceled\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:283\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Delete:161\n\tstorj.io/storj/pkg/pb.DRPCPiecestoreDescription.Method.func3:1080\n\tstorj.io/drpc/drpcserver.(*Server).doHandle:175\n\tstorj.io/drpc/drpcserver.(*Server).HandleRPC:153\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:114\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:147\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

Context canceled means my node was not among the fastest set of nodes right?
Is this applicable to delete messages?

Thanks!

BrightSilence · January 7, 2020, 10:00pm

Logically speaking it is not. With uploads and downloads, the action is intentionally over-provisioned so that if a few transfers fail or are very slow, it doesn’t impact the transfer as a whole. For deletes this really doesn’t apply as the data should be deleted from all nodes. Now I’m sure garbage collection will clean up these pieces eventually, but that’s a fallback and doesn’t mean the initial problem doesn’t need to be solved.

mrkeyboardcommando · January 7, 2020, 10:04pm

yeah… that is exactly how i thought it should work.
So i am surprised by this message.

deletions should be network wide.

i know a put can fail now and then, that is not such a big deal.

Is there anything i can do to help with this? logs?

Thanks!