OOM Killer invoked due to slow disk I/O

baker · June 24, 2020, 1:25pm

I was having an issue with my rock64 (similar to pi4) having huge IOwaits under heavy load. What seemed to happen is some operation would hit the HDD (garbage collection perhaps), and then the requests would start piling up. When there was a heavy load on the network, this backlog would eventually overwhelm the system and OOM killer would kick in. I found that setting the max concurrent requests to 40 alleviated this problem. You would need to add this line to your config.yaml file (or un-comment if it’s already there) then restart the node:

storage2.max-concurrent-requests: 40

With this setting I am no longer having any OOM problems, since the node will stop accepting requests when it gets overwhelmed. It doesn’t happen too often, and only for short periods. With a setting of 40 I was getting an acceptance rate of 92%. I recently upped the limit to 50, and have an acceptance rate of 96%. I think this setting is a good trade off for low powered nodes as a stop-gap measure in case your HDD is getting thrashed. Keep in mind the value of 40 might not be optimal for you, and you should do some testing to figure out where the sweet spot is.