Facing dying hdd

node1 · December 31, 2023, 6:25pm

Now i see even more interesting error: "Got a signal from the OS: “terminated”
Any idea what is going on? (currently drive is connected to another machine, than before).

2023-12-31 17:52:12,883 WARN received SIGTERM indicating exit request
2023-12-31 17:52:12,890 INFO waiting for storagenode, processes-exit-eventlistener, storagenode-updater to die
2023-12-31T17:52:12Z INFO Got a signal from the OS: “terminated” {“Process”: “storagenode-updater”}
2023-12-31 17:52:12,897 INFO stopped: storagenode-updater (exit status 0)
2023-12-31T17:52:12Z INFO Got a signal from the OS: “terminated” {“process”: “storagenode”}
2023-12-31T17:52:12Z INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “IGMBY5JYUVXKVPIBRGVZ5XM27LCB732OLEAUN63A7BTWBH5SU74A”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “Offset”: 0, “Size”: 290048, “Remote Address”: “5.161.64.186:44440”}
2023-12-31T17:52:12Z INFO piecestore downloaded {“process”: “storagenode”, “Piece ID”: “TRRDJZ7UB3CPJUG2TGCTMOFME77755IYR2SALDR5VC6VSXWJ4TOA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “GET”, “Offset”: 2061824, “Size”: 4864, “Remote Address”: “158.180.29.230:47726”}
2023-12-31T17:52:12Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “27ZOTVSKG4L2IAX7U6RQWIWIFZFXWY5MIHGVEHXLXZNBT233CNKA”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “PUT”, “Size”: 327680, “Remote Address”: “5.161.128.79:23270”}
2023-12-31T17:52:12Z INFO piecestore upload canceled {“process”: “storagenode”, “Piece ID”: “QTHGYC2LJVCE5H2ON2G4PDRPILI3P26FJDOTKSXLOWW47473OSEA”, “Satellite ID”: “12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”, “Action”: “PUT”, “Size”: 65536, “Remote Address”: “79.127.205.225:51272”}
2023-12-31 17:52:13,431 INFO stopped: storagenode (exit status 0)
2023-12-31 17:52:13,434 INFO stopped: processes-exit-eventlistener (terminated by SIGTERM)

node1 · December 31, 2023, 6:25pm

Currently i only have ubuntu and mac os But thank you for your suggestion

Alexey · January 1, 2024, 3:20am

You need to check logs before. It either a new version installed by storagenode-updater or a FATAL error.

node1 · January 1, 2024, 9:52am

Alexey could you give me more information on what exact log to see? In the previous message i showed what is visible in the log (storagenode container log), but maybe you are talking about different log?

JWvdV · January 1, 2024, 11:34am

I’m on Debian, which is running fine. I once did a trial with Ubuntu; which gave me a lot of trouble over and over, with also dieing nodes and turning mount points in read only mode and so on. Since I couldn’t get grip on the root cause but knew it worked fine on Debian, I reverted back to Debian and never left that path since

Alexey · January 1, 2024, 12:34pm

You can search for FATAL errors in the logs

docker logs storagenode 2>&1 | grep FATAL

node1 · January 1, 2024, 12:43pm

I run nodes on ubuntu for many years. Never had any problem other than hdd failure.

Alexey · September 11, 2024, 4:41am

All these backups are useless for storagenode.
As soon as you restore it from the backup and bring online, it will be disqualified for lost data since the last backup, it will be quick. It’s possible to make rsync more often and there is a small chance, that the missing data would be less than 4%, but I doubt.

If you have doubtful HDDs, then perhaps it would be better to either run a normal RAID with autocorrection, like raidz or mirror (zfs too), or run several nodes, each on own HDD. Due to a node selection algorithm these nodes would be treated as a one big node and the ingress traffic for your subnet would be distributed between them. Thus if one disk die, you would lose only this one piece of the common data, not all as in case of mergefs or other RAID0 solutions with zero fault tolerance.

Alexey · September 11, 2024, 6:51am

From their manual

It would not help to recover data which were changed since the last sync. The same as for rsync or any other backup programs.
Or you need to do a sync as frequently as possible.
There are other limitations, including the long running sync while files are still modifying, so not so useful for storagenodes’ as you may think. But I would be glad to hear how did you manage it and how did you fix the failure of the one disk, and would the node(s) survive after that.
Also, it looks like you’re losing disk space for parity anyway, just like with regular RAID with parity.