Help us to test garbage collection

littleskunk · August 6, 2019, 1:34am

In the past few days I have tested garbage collection to make sure it deletes only garbage and nothing else. If you want to know more about garbage collection you should take a look at https://github.com/storj/storj/blob/538e8168a524a82ef63c6afd2b6d93bb044dca81/docs/design/garbage-collection.md
The next step is to enable it on one or two satellites to catch possible performance impacts and to double check which pieces it would delete.

Round 1:

Please help us to verify that the garbage collection works as expected.
What do you need?

A big logfile that includes a few weeks of lifetime.
Add storage2.retain-status: "debug" to your config file and set log.level: debug as well. Debug mode means the storage node will execute garbage collection without deleting anything. It will only create logmessage with the pieceIDs it identified as garbage. Please use only debug mode for now!
Restart your storage node and wait until the satellite sends you a garbage collection message. I will let you know when this should happen.
At some point you will see About to delete piece id or Deleted 0 pieces during retain in your logfile. Do you get any of these messages?
Copy the pieceIDs and search for it in the logfile. A regular expression like grep 'ID1\|ID2\|ID3' is usefull here. We don’t want to see any audit requests for pieceIDs after the garbage collection message.

I don’t expect any performance impact in round 1 because we are starting with a smaller statellite.

BrightSilence · August 6, 2019, 8:49am

Added the line and restarted the node. I have log files going all the way back to June 14th.

Waiting for your signal (step 3).

littleskunk · August 6, 2019, 11:16am

My storage node is prepared as well. @stefanbenten will join us with his pi3 storage node. I think we are ready to go. I will ask devops to push the button.

littleskunk · August 6, 2019, 1:28pm

It is activated on two satellites now. I received my first bloom filter. The number of garbage pieces are concerning.

2019-08-06T15:18:27.257+0200 DEBUG piecestore piecestore/endpoint.go:652 Deleted 50448 pieces during retain. RetainStatus: debug

I will have to double check that.

BrightSilence · August 6, 2019, 2:22pm

Looks like we need log level debug as well. That wasn’t part of the initial instructions, but I’ll switch now.

littleskunk · August 6, 2019, 3:23pm

My fault and good point. I will update it.

littleskunk · August 6, 2019, 4:15pm

It turns out that the 50,000 delete messages are duplicates. I can see the same set of pieceIDs 2300 times.

Meanwhile the second satellite sent me the bloomfilter as well.

Update: The storage node is running over its database with a batch size of 1000. In debug mode it doesn’t delete the pieces but it recalculates the offset as it would have deleted pieces. That is why it prints out the same pieceIDs multiple times. Fix is incoming. It does affect only debug mode.

BrightSilence · August 6, 2019, 7:15pm

I have not seen either message yet. Should I have? I’ve been running the debug log for 5 hours now.

$ grep 'About to delete piece id' /volume1/storj/v3/data/node.log
$ grep 'pieces during retain' /volume1/storj/v3/data/node.log
$

littleskunk · August 6, 2019, 10:59pm

Garbage collection gets triggered by the satellite. On the storage node side you should see one message (which we missed because of the log level) followed by a long brake.

sorry2xs · August 8, 2019, 2:14pm

? just were in the config file would would the Add storage2. retain-status: “debug” get placed please no loling

Alexey · August 8, 2019, 2:37pm

At the end of the file. Please, take a look more close - it doesn’t have spaces in the parameter name

storage2.retain-status: "debug"

Don’t forget to replace the log level to debug too

log.level: debug

Save the file and restart the container:

docker restart -t 300 storagenode

sorry2xs · August 8, 2019, 5:46pm

log all GRPC traffic to zap logger

server.debug-log-traffic: true is this what you are referring to @Alexey

littleskunk · August 8, 2019, 6:27pm

Does that match what he has written? I don’t see it.

sorry2xs · August 8, 2019, 7:41pm

I want to make sure that the following is correct before I make the change permanent # log all GRPC traffic to zap logger
server.debug-log-traffic: true
storage2.retain-status: “debug”

BrightSilence · August 8, 2019, 8:55pm

it’s not. @Alexey posted EXACTLY what you should change. In code blocks even. I’m not sure what the confusion is.

Alexey · August 8, 2019, 9:32pm

No.

Especially with curly quotes ” it will restarting indefinitely.
Please, never use the Notes or other word processors for code, install the Plain Text editor instead.
If you want to participate, then please, configure like I posted

littleskunk · August 9, 2019, 12:35pm

Round 2:

This time we will activate garbage collection on the biggest satellite and watch performance. We might even restart the satellite a few times to repeat it.

We are aware that the behavior on the storage node side will looks strange.

You will see the same pieceIDs multiple times because the offset calculation is wrong. Will be fixed with the next release.
You have only 20 seconds to finish garbage collection or it will be canceled by the satellite. We don’t have a fix for that issue. For safety reasons that means we should keep it on debug mode to avoid any risks.

Storgeez · August 10, 2019, 12:10am

I have just noticed this thread and have added the entries to my config as well. Awaiting About to.

BrightSilence · August 10, 2019, 7:26am

That was a shocking amount of messages in the log. Indeed lots of duplicates. Grepping through the piece ID’s took a long time, but I see no audits or other traffic on these pieces after the “About to delete” messages. So all looks good!

littleskunk · August 10, 2019, 11:52am

Test results:

Performance on the satellite looks good.

[3659231273988476882] storj.io/storj/satellite/gc.(*Service).Run.func1
  parents: 5886239439356780943
  current: 0, highwater: 1, success: 1, errors: 0, panics: 0
  success times:
    0.00: 2h2m20.163072s
    0.10: 2h2m20.163072s
    0.25: 2h2m20.163072s
    0.50: 2h2m20.163072s
    0.75: 2h2m20.163072s
    0.90: 2h2m20.163072s
    0.95: 2h2m20.163072s
    1.00: 2h2m20.163072s
    avg: 2h2m20.163150448s
    ravg: 2h2m20.163072s
  failure times:
    0.00: 0s
    0.10: 0s
    0.25: 0s
    0.50: 0s
    0.75: 0s
    0.90: 0s
    0.95: 0s
    1.00: 0s
    avg: 0s
    ravg: 0s

Fix for issue one is working (installed on my storage node) and will be deployed with the next release.
My storage node is running into the timeout The good news is that in these 20 seconds my node managed to get close to 50%.

Next step:
I will now enable garbage collection on my storage nodes. Please keep your nodes on debug mode! I have installed a fix on my storage node that is missing on your node. Don’t risk it!

A few PRs to change default configs:

Run garabge collection every 5 days instead of 7 days. Because I like to have one additional run in combination with our 14 days release cycle.
Enable garbage collection on all satellites.
Enable garbage collection on all storage nodes (again please don’t do it now. Wait for the fix!)
Add an information to the changelog. We are aware of the timeout bug.