Some time ago I was setting up a new node on a Thinkpad x200 I got for pennies. Decent hardware for a node (I used to run a node on X61 with no problems), though obviously very old at this point. The node was working for about a week, but at some point I noticed that the node is slowly failing audits. First on us1, then on eu1 and ap1, the audit score very slowly started going down.
Obviously I checked the logs. Given this was a young node, all had all its logs since the node’s beginning. And, well, kept grepping again and again, and all GET_AUDITs were properly successful. At first I thought it might be just a fluke, some corrupted IP packet or a malicious neutrino hitting some chip, etc… it’s the first time in my experience as a SNO that I saw audit scores below 100%. Yet the scores kept going down. I think it took two, maybe three weeks to reach 97% on us1.
I started thinking what could be wrong here. The node believes that it is successfully answering the audits, yet the satellite accounts some of them as failed. I started suspecting that what reports failure is the final reconstruction check after the satellite receives all audit pieces. So, maybe my node was sending incorrect data? And if so, given the storage node code is solid, it must be something else that corrupts data—maybe some other software, maybe hardware.
Started testing everything around: the USB cable my storage drive was connected with (I had some bad experience with one in the past), network cables, storage… at some point I’ve run memtest86 to test memory chips. And I found it! There were some bad areas on one of the chips, roughly at the end of the physical address space. Restarted with the memtest
kernel option, which tests memory on startup and marks bad areas, and after a few hours I started seeing the audit scores going up!
Faulty memory chips may affect data at two stages: reading from the file (i.e. data stored is ok, but egress is affected) or writing to the file (i.e. we store incorrect data already at ingress stage). Right now I think there’s no way to explicitly verify the files, so I cannot exclude the latter process. Yet seeing the audit scores increase I suspect that most data managed to pass the writing stage with no corruption—especially that it’s already more than a week after I added the memtest
option, and the scores only go up now. Maybe the fact that the faulty areas were at the end of the address space somehow kept the node from crashing (the laptop didn’t crash even once during the whole operation!), and that area was mostly used for read caches?
I’ll keep the node alive for “science”…
Some concluding remarks:
- Not all audit failures are visible in the logs. If your monitoring is only based on checking logs with tools like logcheck, that’s not enough. Though, if a node receives an update of statistics from a satellite that decreases a score, maybe it should be explicitly logged with a WARN level?
- Broken hardware may still work well enough to fake correct operation. But audits found the problem. Storj technology is great! I wonder though how regular downloads were affected.
- The
memtest
kernel option doesn’t take much time on startup and does not affect regular operation. I think it’s worth considering on all hardware that acts as servers, whether or not it actually has a known-bad RAM chips. - It would be nice if the original hash of the piece was stored in the file header—I could scan the pieces and verify their correctness manually.