Node Suspended?! what do i do!

davidgoderre · January 11, 2022, 11:20pm

got it. Ill keep an eye on it. Is the below command what i want to use to check logs periodically?

docker logs storagenode 2>&1 | grep -E "GET_AUDIT|GET_REPAIR" | grep failed

BrightSilence · January 11, 2022, 11:21pm

Yep, that should do it. Be sure to post if you see any remaining errors.

davidgoderre · January 12, 2022, 12:07am

looks like a lot less errors. but i am seeing some new ones come through that look like the below. Am i hosed? lol

2022-01-11T23:24:12.730Z	ERROR	piecestore	download failed	{"Piece ID": "4M7ICXOI7RB3DWT2MLQWXIILYCO7NPHCDWCLSIIPNEUQF6ZP3FBQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:545\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}
2022-01-11T23:49:18.237Z	ERROR	piecestore	download failed	{"Piece ID": "4N3WLCAEAVGMG6O5RTANVTBOJO6WCE5PKAVQIGNHPVI5GQTT7KLA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:545\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

BrightSilence · January 12, 2022, 12:27am

Unfortunately this does suggest permanent data loss. Your node may survive, but your scores will always fluctuate. It really depends on how much data you lost.

pietjebell · January 12, 2022, 12:48pm

Aren’t you supposed to run the repair on the partition (i.e. sda1) instead of the drive itself (i.e. sda). I’m not sure if the repair was performed correctly.

At least this was the problem for me a couple of days ago with a broken filesystem after power loss. I had to use xfs_repair in my case. YMMV ofcourse

SGC · January 12, 2022, 3:10pm

i would think it depends on what filesystems is supported by the tool being used to repair the filesystem.

sda would be the entire drive, but whatever tool one is using might allow one to use the entire disk name, to fix it… like say if you only have 1 partition containing an xfs filesystem, it might not matter how you write it.

the partition isn’t really the filesystem contained on a partition, another option one can use for verifying hdd’s or ssd’s is smartctl, which can be good for identifying hardware issues with the disk, which can cause the other issues… but like bright said, often hdd’s just make errors and that damages the filesystem.

the partition itself is rarely damaged because its rarely changed, only when you change filesystems on a partition does it really get changed. afaik

but the nitty gritty details of how hdd’s really work is a rather complex topic
don’t think it matters much what one uses to repair with, tho some high end tools are significantly better than others, atleast for data recovery…
however data recovery is a science in itself and the tools fairly expensive.

if data loss is critical and recovery needed, one should disconnect the drive immediately and take it to a professional, for the best odds of recovering the data.

hdd’s seems to much more likely to have errors than ssd’s, but thats not really a big surprise since one is mechanical and the other solid state.

SGC · January 12, 2022, 3:17pm

you might want to check the smart of the hdd and get a sense of if it has any hardware issues.

davidgoderre · January 13, 2022, 4:40pm

this looks to have all checked out. so i will let the node run for a bit and see what happens.

If my scores on audit fluctuate, how bad is that?

SGC · January 13, 2022, 5:52pm

DQ is at 60% audit score, anything higher than that is generally fine… ofc you want to see it as high as possible, but not much one can do to change it… it’s just a gauge of how well the node is doing.

by generally fine i mean, i’m unaware of any real detrimental effects.

davidgoderre · January 13, 2022, 6:03pm

got it i appreciate the info. I have audits around 90-94% now. which seems ok, but it has the yellow exclamation point next to a couple of satellites. i will monitor and see if it goes below that to 60%. maybe i just had a power surge that messed up some stuff.