Runtime: unexpected return pc for runtime.greyobject

greener · September 9, 2019, 12:12pm

I just noticed my storagenode container got restarted with fatal error: index out of range error in logs.
Seems like this one is new:

2019-09-09T11:28:53.471Z        INFO    piecestore      deleted {"Piece ID": "QGDKDF7DY5X2GZLQAIZ2GTEEA7UIFLCCB3PFWL5KNGWO7WHMNMWA"}
2019-09-09T11:29:04.098Z        INFO    piecestore      deleted {"Piece ID": "D3VZNIYT5ZMNSWFN6A7JZFIIJVHZVPXGV7N3JBL6UDVQ4ND52GBA"}
2019-09-09T11:29:14.221Z        INFO    piecestore      download started        {"Piece ID": "IWZWSITTYDYI33JVNNPEXRB4CHQK4PQIDPL5GG4SKAHXIGFGBK7Q", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
2019-09-09T11:29:16.581Z        INFO    piecestore      downloaded      {"Piece ID": "IWZWSITTYDYI33JVNNPEXRB4CHQK4PQIDPL5GG4SKAHXIGFGBK7Q", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
fatal error: index out of range

runtime stack:
runtime: unexpected return pc for runtime.greyobject called from 0x0
stack: frame={sp:0x7fa21268ed98, fp:0x7fa21268edc8} stack=[0x7fa21266f248,0x7fa21268ee48)
00007fa21268ec98:  0000000000eb3088  0000000000000001
...

Node seems to be running fine otherwise and no issues found on the host. Is this something to worry about?

vedalken254 · September 9, 2019, 4:07pm

That would be a debug stack trace for a crash after the fatal error. If the node auto-restarted, I wouldn’t worry too much about it unless you start seeing more of them.

greener · September 12, 2019, 9:44am

Container crashed again today with the same output

2019-09-12T07:24:42.297Z        INFO    piecestore      download started        {"Piece ID": "TX6RXHJ4L3ARD4WAGECGHATNADMYSQHYWIUXAZW5YXYPVMV353OQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET"}
fatal error: index out of range

runtime stack:
runtime: unexpected return pc for runtime.greyobject called from 0x0
stack: frame={sp:0x7f55ff4b6d98, fp:0x7f55ff4b6dc8} stack=[0x7f55ff497248,0x7f55ff4b6e48)
00007f55ff4b6c98:  0000000000eb3088  0000000000000001

vedalken254 · September 12, 2019, 4:16pm

I recommend creating a post outside this thread with the full stack trace, os, CPU arch, and any other info you have on these particular crashes in the troubleshooting section of the forums.

Alexey · September 13, 2019, 7:50am

Already moved. Thank you!

greener · September 13, 2019, 10:25am

Ok sure, here’s my full trace log https://pastebin.com/u5cpW0CW

Here are my host/os stats:

# lscpu  | egrep "Model name|Architecture"
Architecture:          x86_64
Model name:            Intel(R) Atom(TM) CPU N2800   @ 1.86GHz

# cat /etc/*release | grep PRETTY_NAME
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"

# free -m
              total        used        free      shared  buff/cache   available
Mem:           3927        1649        1889          92         388        2089
Swap:          4095         121        3974

# mount | grep data
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl)

Container data is mounted to zfs filesystem, could that be causing problems maybe?

thepaul · September 13, 2019, 5:08pm

Wow, that’s a concerning one- especially happening more than once. This is indicating a crash of the garbage collecting subsystem in Go, which suggests heap corruption. That, in turn, indicates either a relatively serious and easy-to-trigger bug in the Go runtime itself, or else a host with faulty RAM. Since it looks like no one else (among storage node operators or the wider Go community) has encountered this, I’m afraid I have to suspect the hardware problem. Would you be able to take the machine offline long enough to perform a memtest86 check?

thepaul · September 13, 2019, 5:10pm

I don’t think zfs could be to blame, because if the OS and Go runtime are working correctly, the filesystem shouldn’t be able to cause heap corruption.

greener · September 14, 2019, 8:53am

@thepaul thanks for checking,
I’ll keep an eye on it for now. If it keeps crashing I’ll check the hardware. Will update here if I find anything.

greener · September 26, 2019, 12:23pm

I noticed my box was running at around 85% used ram and zfs arc was the majority of that. Arc is a type of cache though it seems when it went up hogging ram, it was causing storagenode memory to go down while the total used ram was still ~85%. Since I limited arc to 20% on the system it did not crash for about 10 days now.

It looks like arc was increasing memory pressure and docker wouldn’t allocate memory beyond a certain point (85%?) causing the container to crash. Does this go against my trace logs?

thepaul · October 10, 2019, 9:27am

Sorry for the delay in responding- I think I missed a notification. I don’t think that a memory allocation failure could cause that sort of crash within the gc subsystem; if it did, that would be a pretty major bug in Go (which doesn’t mean it’s impossible, by any means, just that somebody else probably would have encountered it by now if that were the case). So I don’t think the problem is just memory pressure, but it does sound like memory pressure is related. If there really is some bad RAM, then perhaps lower memory pressure results in the bad sectors not being used as often, through whatever emergent dynamics are present on your system. So that’s still my chief theory. But if your nose is running well now, then great!