One of my nodes are starting to fail

camulodunum · November 20, 2020, 11:38am

I really don’t know why,but suddenly this mont one of my nodes have started to lose point in suspension and online parameters in almost all the satelites.

so I performed this script (Script: Calculate Success Rates for Audit, Download, Upload, Repair) and here are the results:

as you can see,mostly 26% of downloads are failing,plus 3% canceles means roughtly 70% of dowloads are being succesfull.

also,this node is operated in the same raspberry where I have nother node that has been with 0 issues,and only started to happen this month,so I’m confused and with no clue about what can be making that much of downloads to fail

nerdatwork · November 20, 2020, 11:42am

Search your log for download failed and post the results. You can search failed audits by looking for 2 keywords in the same log line GET_AUDIT and failed

konsou · November 20, 2020, 11:58am

You might want to check your hard drive - it may be failing:

sudo badblocks -v /dev/(whatever-your-device-is)

This will take quite long, probably several hours.

SGC · November 20, 2020, 6:45pm

run some diagnostics and don’t forget to check your cables… i would start by shutting the node down… to not create more damage while you figure out whats wrong…

i would start by checking the smart stats of the hdd, if those says it’s going bad, then try to copy the data to a good drive and hope for the best.

Pac · November 21, 2020, 9:22pm

@camulodunum: If you shut down your node, be careful not to remove your docker container while shutting it down if you’re using the default docker logs though, otherwise they will be lost and you will not be able to analyse them.

Just shut down the node with:

docker stop -t 300 YOUR_NODE_NAME

camulodunum · November 26, 2020, 5:36am

here I have he logs.

seems to e some type of problems that is making the node to restart al time

2020-11-26T05:29:24.085Z 2020-11-26T05:29:24.088Z 2020-11-26T05:29:24.089Z 2020-11-26T05:29:24.971Z 2020-11-26T05:29:25.033Z 2020-11-26T05:29:26.543Z 2020-11-26T05:29:27.532Z 2020-11-26T05:29:27.534Z 2020-11-26T05:29:27.537Z 2020-11-26T05:29:27.537Z 2020-11-26T05:29:27.537Z 2020-11-26T05:29:27.539Z 2020-11-26T05:29:29.909Z 2020-11-26T05:29:33.515Z 2020-11-26T05:29:33.704Z 2020-11-26T05:29:40.912Z panic: runtime panic: runtime panic: runtime panic: runtime INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
INFO Operator email {“Address”: “"}
INFO Operator wallet {“Address”: "***********”}
INFO Telemetry enabled
INFO db.migration Database Version {“version”: 46}
INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
INFO bandwidth Performing bandwidth usage rollups
INFO Node ********************************** started
INFO Public server started on [::]:****
INFO Private server started on ******
INFO trust Scheduling next refresh {“after”: “7h35m35.583594248s”}
INFO piecestore upload started {“Piece ID”: “RU555EFBWDQX2PB5HTDUFNEFSTZD6V4FG6BPXAQVUZKSFY6GEKBA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT_REPAIR”, “Available Space”: 1165370009856}
INFO piecestore download started {“Piece ID”: “4NSU2V4R6KAJNV77PM4DPYFOFKRUVHSCQKUFWMWDQV7BUKE4LDKQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET”}
ERROR piecestore download failed {“Piece ID”: “4NSU2V4R6KAJNV77PM4DPYFOFKRUVHSCQKUFWMWDQV7BUKE4LDKQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp 35.194.133.253:7777: operation was canceled”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp 35.194.133.253:7777: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:462\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}
WARN ordersfilestore Corrupted order detected in orders file {“error”: “ordersfile corrupt entry: proto: pb.Order: illegal tag 0 (wire type 0)”, “errorVerbose”: “ordersfile corrupt entry: proto: pb.Order: illegal tag 0 (wire type 0)\n\tstorj.io/storj/storagenode/orders/ordersfile.(*fileV0).ReadOne:115\n\tstorj.io/storj/storagenode/orders.(*FileStore).ListUnsentBySatellite.func1:239\n\tpath/filepath.walk:360\n\tpath/filepath.walk:384\n\tpath/filepath.Walk:406\n\tstorj.io/storj/storagenode/orders.(*FileStore).ListUnsentBySatellite:193\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrdersFromFileStore:389\n\tstorj.io/storj/storagenode/orders.(*Service).SendOrders:183\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func1:134\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
error: makeslice: len out of range [recovered]
error: makeslice: len out of range [recovered]
error: makeslice: len out of range [recovered]
error: makeslice: len out of range

nerdatwork · November 26, 2020, 5:38am

camulodunum · November 26, 2020, 5:40am

mmmmm.

weird since my node is running in 64 bits arm system

Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-1022-raspi aarch64)

ifraixedes · November 26, 2020, 10:21am

Following the source files reported in the log that you have attached, I don’t find the same path that I found in my response to the other thread nor I cannot see why it’s panicking because it’s returning an error.

I assumed that you are running v1.16.1 is that a right assumption?

nerdatwork · November 26, 2020, 10:27am

I think he does run v1.16.1 unless I am misreading this relation between database version with node version. As v1.15.3 had version 45

camulodunum · November 26, 2020, 10:47am

yeah 1.16.1 with watchtower working.

at least for now moving the files seems to have solved the problems (the node is now 2h 36 min without restars) but can’t be sure tillsome days have passed.

the weird thing about this is that I have 2 nodes runing on the same raspberry,and only 1 of them have had this issues

ifraixedes · November 26, 2020, 12:17pm

yeah 1.16.1 with watchtower working

My apologies, I was wrong, the source path is the same, I was confused because the panic doesn’t come from the make function and I thought that it was when I investigated the case in the other post and I didn’t compare both log files line by line.

On the other hand, I’ve found that this could happen even in 64 bits architecture but the cause is different although the crash comes from the same part.

While in 32 bits architecture could be due to getting a negative length passed to the make function in 64 bits that cannot happen but what can happen in both architectures is that if the length is a very big number the make function panics with panic: runtime error: makeslice: len out of range

This panic is different than using a number which isn’t that big but overpasses the maximum memory size available in the system, which is fatal error: runtime: out of memory

I can say now this because I’ve found panic: runtime error: makeslice: len out of range · Issue #38673 · golang/go · GitHub and I tried with a minimal main file that executed to get both different panics mentioned.

The good think is that v1.17.4 had a commit to fix the problem that could only happen in the 32 bits architecture, but later one it was another commit that limits the number to be of a reasonable size (storagenode/orders/ordersfile: Add reasonable size caps for orders/li… · storj/storj@41d86c0 · GitHub) and the commit also landed in the v1.17.4. Hence, v1.17.4 should fix this issue too.

at least for now moving the files seems to have solved the problems (the node is now 2h 36 min without restars) but can’t be sure tillsome days have passed.

You are safe, as far as no other corrupted order file is newly created, if that happens you have to proceed as before, until your node gets updated to v1.17.4.

Note we are rolling out v1.17.4 and Docker images should be published in less than a week from now.

the weird thing about this is that I have 2 nodes runing on the same raspberry,and only 1 of them have had this issues

It isn’t weird at all. Basically one of your nodes has created a corrupted order file while the other hasn’t.
I 2 nodes running and I’ve been lucky that they have never had this problem.