A short (xmas?) story on silent failing audits

Toyoo · December 12, 2022, 2:15pm

Some time ago I was setting up a new node on a Thinkpad x200 I got for pennies. Decent hardware for a node (I used to run a node on X61 with no problems), though obviously very old at this point. The node was working for about a week, but at some point I noticed that the node is slowly failing audits. First on us1, then on eu1 and ap1, the audit score very slowly started going down.

Obviously I checked the logs. Given this was a young node, all had all its logs since the node’s beginning. And, well, kept grepping again and again, and all GET_AUDITs were properly successful. At first I thought it might be just a fluke, some corrupted IP packet or a malicious neutrino hitting some chip, etc… it’s the first time in my experience as a SNO that I saw audit scores below 100%. Yet the scores kept going down. I think it took two, maybe three weeks to reach 97% on us1.

I started thinking what could be wrong here. The node believes that it is successfully answering the audits, yet the satellite accounts some of them as failed. I started suspecting that what reports failure is the final reconstruction check after the satellite receives all audit pieces. So, maybe my node was sending incorrect data? And if so, given the storage node code is solid, it must be something else that corrupts data—maybe some other software, maybe hardware.

Started testing everything around: the USB cable my storage drive was connected with (I had some bad experience with one in the past), network cables, storage… at some point I’ve run memtest86 to test memory chips. And I found it! There were some bad areas on one of the chips, roughly at the end of the physical address space. Restarted with the memtest kernel option, which tests memory on startup and marks bad areas, and after a few hours I started seeing the audit scores going up!

Faulty memory chips may affect data at two stages: reading from the file (i.e. data stored is ok, but egress is affected) or writing to the file (i.e. we store incorrect data already at ingress stage). Right now I think there’s no way to explicitly verify the files, so I cannot exclude the latter process. Yet seeing the audit scores increase I suspect that most data managed to pass the writing stage with no corruption—especially that it’s already more than a week after I added the memtest option, and the scores only go up now. Maybe the fact that the faulty areas were at the end of the address space somehow kept the node from crashing (the laptop didn’t crash even once during the whole operation!), and that area was mostly used for read caches?

I’ll keep the node alive for “science”…

Some concluding remarks:

Not all audit failures are visible in the logs. If your monitoring is only based on checking logs with tools like logcheck, that’s not enough. Though, if a node receives an update of statistics from a satellite that decreases a score, maybe it should be explicitly logged with a WARN level?
Broken hardware may still work well enough to fake correct operation. But audits found the problem. Storj technology is great! I wonder though how regular downloads were affected.
The memtest kernel option doesn’t take much time on startup and does not affect regular operation. I think it’s worth considering on all hardware that acts as servers, whether or not it actually has a known-bad RAM chips.
It would be nice if the original hash of the piece was stored in the file header—I could scan the pieces and verify their correctness manually.

Pentium100 · December 12, 2022, 4:32pm

This is a problem, the satellite should inform the node about a failed audit with the reason it failed, otherwise only file-not-found errors get logged.

Yes. It would be possible to run a “scrub” on the file. For example, in your case - are those files corrupted in storage or were they only corrupted when sending a response to the audit request and if the audit was repeated it would succeed?

In a real server ECC memory should catch these problems and possibly correct them. I wonder how long it would take to test 100GB of RAM with this.

Properly testing memory takes time as not all failures are easy to discover - sometimes the memory needs to reach a certain temperature etc. This is why ECC memory is important for servers. Also, even perfectly working memory can get hit with a cosmic ray and flip a bit.

Toyoo · December 12, 2022, 4:48pm

It seems they’re fine on storage, or at least a significant amount of them to not fail too many audits anymore. Can’t tell more than that.

The memtest switch is not a full replacement for memtest86, but it’s enough to quickly find the most obvious problems. The biggest machine I have available right now has 16 GB of RAM, and the test takes about 2 minutes. Given that I restart it maybe once a year, it’s not long.

thepaul · December 12, 2022, 7:33pm

You’ll be pleased to know that the original hash is stored in the file header! It’s just that it’s encoded in a protobuf that isn’t readable by standard tools.

The first 2 bytes of each piece file give the length of the protobuf-encoded header that follows. If you decode that as a PieceHeader, the hash field contains the hash of the piece (which should correspond to everything in the piece file after the first 512 bytes). The hash_algorithm field tells you if the hash used was SHA256 (0) or Blake3 (1).

Toyoo · December 12, 2022, 7:40pm

I missed it! Thank you, for some reason I thought it would be a part of OrderLimit, and not a separate field in PieceHeader.