Reducing logs by changing error tag for lost races

snorkel · April 15, 2023, 4:43am

I try to matain logs at minimum, because space is precious, for reducing I/O on my drives and to keep an eye for important errors, but the logs are full off errors from lost races, that are pretty much normal to all healty nodes, and don’t indicate a real problem, that needs taking immediat action, and makes it almost impossible to spot real problems without filters.
I propose, as many others, changing the error tag to warn or info for lost races errors from uploads and downloads.
This way, setting log level to error will produce more usefull logs without the unneeded clutter.

Alexey · April 15, 2023, 5:07am

This requires to implement an AI logic to distinguish race loss errors from the errors related to your network (like canceled by soft on your computer) or disk (i/o timeouts, not readable blobs, etc.) or overload (timeout, but need to detect what kind of a timeout - because client stopped to respond, or because your hardware is overloaded or somewhere broken, like RAM issues?).

I welcome you to create a pull request on our GitHub with such intelligent logic, we will be happy to review and merge!

Please note - the such logic requires additional CPU cycles while your node is trying to serve the request, competing with 109 (38 for downloads) other nodes.

I would suggest instead to implement a post-processing logic using either Grafana Loki or even ELK or anything other, include filtering scripts, and make the source logs short-lived.

I personally configured a logrotate and keep 12 one-month archives for further investigation in case of bugs in our software to help developers to fix them quicker.
They used only 2.8GiB of precise space.

(ls X:\storagenode2\*.gz | Measure-Object -Property Length -Sum).Sum / 1GB
2.80363264027983