One of my nodes are starting to fail

I really don’t know why,but suddenly this mont one of my nodes have started to lose point in suspension and online parameters in almost all the satelites.

so I performed this script (Script: Calculate Success Rates for Audit, Download, Upload, Repair) and here are the results:

as you can see,mostly 26% of downloads are failing,plus 3% canceles means roughtly 70% of dowloads are being succesfull.

also,this node is operated in the same raspberry where I have nother node that has been with 0 issues,and only started to happen this month,so I’m confused and with no clue about what can be making that much of downloads to fail

Search your log for download failed and post the results. You can search failed audits by looking for 2 keywords in the same log line GET_AUDIT and failed

You might want to check your hard drive - it may be failing:

sudo badblocks -v /dev/(whatever-your-device-is)

This will take quite long, probably several hours.

1 Like

run some diagnostics and don’t forget to check your cables… i would start by shutting the node down… to not create more damage while you figure out whats wrong…

i would start by checking the smart stats of the hdd, if those says it’s going bad, then try to copy the data to a good drive and hope for the best.

@camulodunum: If you shut down your node, be careful not to remove your docker container while shutting it down if you’re using the default docker logs though, otherwise they will be lost and you will not be able to analyse them.

Just shut down the node with:

docker stop -t 300 YOUR_NODE_NAME
1 Like

here I have he logs.

seems to e some type of problems that is making the node to restart al time

2020-11-26T05:29:24.085Z INFO Configuration loaded {“Location”: “/app/config/config.yaml”}
2020-11-26T05:29:24.088Z INFO Operator email {“Address”: “"}
2020-11-26T05:29:24.089Z INFO Operator wallet {“Address”: "
***********”}
2020-11-26T05:29:24.971Z INFO Telemetry enabled
2020-11-26T05:29:25.033Z INFO db.migration Database Version {“version”: 46}
2020-11-26T05:29:26.543Z INFO preflight:localtime start checking local system clock with trusted satellites’ system clock.
2020-11-26T05:29:27.532Z INFO preflight:localtime local system clock is in sync with trusted satellites’ system clock.
2020-11-26T05:29:27.534Z INFO bandwidth Performing bandwidth usage rollups
2020-11-26T05:29:27.537Z INFO Node ********************************** started
2020-11-26T05:29:27.537Z INFO Public server started on [::]:****
2020-11-26T05:29:27.537Z INFO Private server started on ******
2020-11-26T05:29:27.539Z INFO trust Scheduling next refresh {“after”: “7h35m35.583594248s”}
2020-11-26T05:29:29.909Z INFO piecestore upload started {“Piece ID”: “RU555EFBWDQX2PB5HTDUFNEFSTZD6V4FG6BPXAQVUZKSFY6GEKBA”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “PUT_REPAIR”, “Available Space”: 1165370009856}
2020-11-26T05:29:33.515Z INFO piecestore download started {“Piece ID”: “4NSU2V4R6KAJNV77PM4DPYFOFKRUVHSCQKUFWMWDQV7BUKE4LDKQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET”}
2020-11-26T05:29:33.704Z ERROR piecestore download failed {“Piece ID”: “4NSU2V4R6KAJNV77PM4DPYFOFKRUVHSCQKUFWMWDQV7BUKE4LDKQ”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET”, “error”: “untrusted: unable to get signee: trust: rpc: dial tcp 35.194.133.253:7777: operation was canceled”, “errorVerbose”: “untrusted: unable to get signee: trust: rpc: dial tcp 35.194.133.253:7777: operation was canceled\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).VerifyOrderLimitSignature:140\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).verifyOrderLimit:62\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:462\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}
2020-11-26T05:29:40.912Z WARN ordersfilestore Corrupted order detected in orders file {“error”: “ordersfile corrupt entry: proto: pb.Order: illegal tag 0 (wire type 0)”, “errorVerbose”: “ordersfile corrupt entry: proto: pb.Order: illegal tag 0 (wire type 0)\n\tstorj.io/storj/storagenode/orders/ordersfile.(*fileV0).ReadOne:115\n\tstorj.io/storj/storagenode/orders.(*FileStore).ListUnsentBySatellite.func1:239\n\tpath/filepath.walk:360\n\tpath/filepath.walk:384\n\tpath/filepath.Walk:406\n\tstorj.io/storj/storagenode/orders.(*FileStore).ListUnsentBySatellite:193\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrdersFromFileStore:389\n\tstorj.io/storj/storagenode/orders.(*Service).SendOrders:183\n\tstorj.io/storj/storagenode/orders.(*Service).Run.func1:134\n\tstorj.io/common/sync2.(*Cycle).Run:92\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}
panic: runtime error: makeslice: len out of range [recovered]
panic: runtime error: makeslice: len out of range [recovered]
panic: runtime error: makeslice: len out of range [recovered]
panic: runtime error: makeslice: len out of range

1 Like

mmmmm.

weird since my node is running in 64 bits arm system

Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-1022-raspi aarch64)

:thinking:

Following the source files reported in the log that you have attached, I don’t find the same path that I found in my response to the other thread nor I cannot see why it’s panicking because it’s returning an error.

I assumed that you are running v1.16.1 is that a right assumption?

I think he does run v1.16.1 unless I am misreading this relation between database version with node version. As v1.15.3 had version 45

yeah 1.16.1 with watchtower working.

at least for now moving the files seems to have solved the problems (the node is now 2h 36 min without restars) but can’t be sure tillsome days have passed.

the weird thing about this is that I have 2 nodes runing on the same raspberry,and only 1 of them have had this issues

yeah 1.16.1 with watchtower working

My apologies, I was wrong, the source path is the same, I was confused because the panic doesn’t come from the make function and I thought that it was when I investigated the case in the other post and I didn’t compare both log files line by line.

On the other hand, I’ve found that this could happen even in 64 bits architecture but the cause is different although the crash comes from the same part.

While in 32 bits architecture could be due to getting a negative length passed to the make function in 64 bits that cannot happen but what can happen in both architectures is that if the length is a very big number the make function panics with panic: runtime error: makeslice: len out of range

This panic is different than using a number which isn’t that big but overpasses the maximum memory size available in the system, which is fatal error: runtime: out of memory

I can say now this because I’ve found https://github.com/golang/go/issues/38673 and I tried with a minimal main file that executed to get both different panics mentioned.

The good think is that v1.17.4 had a commit to fix the problem that could only happen in the 32 bits architecture, but later one it was another commit that limits the number to be of a reasonable size (https://github.com/storj/storj/commit/41d86c098576922afaf8002b060ba83f2b5fd802) and the commit also landed in the v1.17.4. Hence, v1.17.4 should fix this issue too.

at least for now moving the files seems to have solved the problems (the node is now 2h 36 min without restars) but can’t be sure tillsome days have passed.

You are safe, as far as no other corrupted order file is newly created, if that happens you have to proceed as before, until your node gets updated to v1.17.4.

Note we are rolling out v1.17.4 and Docker images should be published in less than a week from now.

the weird thing about this is that I have 2 nodes runing on the same raspberry,and only 1 of them have had this issues

It isn’t weird at all. Basically one of your nodes has created a corrupted order file while the other hasn’t.
I 2 nodes running and I’ve been lucky that they have never had this problem.

4 Likes