Please help us to verify that the garbage collection works as expected.
What do you need?
A big logfile that includes a few weeks of lifetime.
Add storage2.retain-status: "debug" to your config file and set log.level: debug as well. Debug mode means the storage node will execute garbage collection without deleting anything. It will only create logmessage with the pieceIDs it identified as garbage. Please use only debug mode for now!
Restart your storage node and wait until the satellite sends you a garbage collection message. I will let you know when this should happen.
At some point you will see About to delete piece id or Deleted 0 pieces during retain in your logfile. Do you get any of these messages?
Copy the pieceIDs and search for it in the logfile. A regular expression like grep 'ID1\|ID2\|ID3' is usefull here. We don’t want to see any audit requests for pieceIDs after the garbage collection message.
I don’t expect any performance impact in round 1 because we are starting with a smaller statellite.
My storage node is prepared as well. @stefanbenten will join us with his pi3 storage node. I think we are ready to go. I will ask devops to push the button.
Meanwhile the second satellite sent me the bloomfilter as well.
Update: The storage node is running over its database with a batch size of 1000. In debug mode it doesn’t delete the pieces but it recalculates the offset as it would have deleted pieces. That is why it prints out the same pieceIDs multiple times. Fix is incoming. It does affect only debug mode.
Garbage collection gets triggered by the satellite. On the storage node side you should see one message (which we missed because of the log level) followed by a long brake.
For later usage a few more commands: grep 'About to delete piece id' /storagenode/storagenode.log | cut -d '(' -f 2 | cut -d ')' -f 1 | sort | uniq | tr '\n' ' ' | sed 's/ /\\|/g'
This will generate the regular expression that can be used to grep the logfile for these specific pieceIDs.
I want to make sure that the following is correct before I make the change permanent # log all GRPC traffic to zap logger
server.debug-log-traffic: true
storage2.retain-status: “debug”
Especially with curly quotes ” it will restarting indefinitely.
Please, never use the Notes or other word processors for code, install the Plain Text editor instead.
If you want to participate, then please, configure like I posted
This time we will activate garbage collection on the biggest satellite and watch performance. We might even restart the satellite a few times to repeat it.
We are aware that the behavior on the storage node side will looks strange.
You will see the same pieceIDs multiple times because the offset calculation is wrong. Will be fixed with the next release.
You have only 20 seconds to finish garbage collection or it will be canceled by the satellite. We don’t have a fix for that issue. For safety reasons that means we should keep it on debug mode to avoid any risks.
That was a shocking amount of messages in the log. Indeed lots of duplicates. Grepping through the piece ID’s took a long time, but I see no audits or other traffic on these pieces after the “About to delete” messages. So all looks good!
Fix for issue one is working (installed on my storage node) and will be deployed with the next release.
My storage node is running into the timeout The good news is that in these 20 seconds my node managed to get close to 50%.
Next step:
I will now enable garbage collection on my storage nodes. Please keep your nodes on debug mode! I have installed a fix on my storage node that is missing on your node. Don’t risk it!
A few PRs to change default configs:
Run garabge collection every 5 days instead of 7 days. Because I like to have one additional run in combination with our 14 days release cycle.
Enable garbage collection on all satellites.
Enable garbage collection on all storage nodes (again please don’t do it now. Wait for the fix!)
Add an information to the changelog. We are aware of the timeout bug.