Node disqualified from Saltlake satellite

mikebzh44 · September 20, 2020, 7:01am

Hi.

I was running a node (1 To) for a while but I had to stop it for few month.

Some weeks ago, I restart it but then I received some mail telling me that my node was disqualified from satellite.

I thought it was because of my offtime period.

So yesterday, I restart from scratch, format the HDD, generate a new identity and run a new node.

The only thing I keep from the old node is my email to generate the token.

My node look good, I store 1.32 Gb right now but this morning I received a mail alerting me that my node is disqualified from Saltlake satellite.

I run this script : Script for Audits stat by satellites - only overall audits

The result :

“1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”
{
“totalCount”: 0,
“successCount”: 0,
“alpha”: 1,
“beta”: 0,
“unknownAlpha”: 1,
“unknownBeta”: 0,
“score”: 1,
“unknownScore”: 1
}
“121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”
{
“totalCount”: 0,
“successCount”: 0,
“alpha”: 1,
“beta”: 0,
“unknownAlpha”: 1,
“unknownBeta”: 0,
“score”: 1,
“unknownScore”: 1
}
“12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”
{
“totalCount”: 0,
“successCount”: 0,
“alpha”: 1,
“beta”: 0,
“unknownAlpha”: 1,
“unknownBeta”: 0,
“score”: 1,
“unknownScore”: 1
}
“12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs”
{
“totalCount”: 0,
“successCount”: 0,
“alpha”: 1,
“beta”: 0,
“unknownAlpha”: 1,
“unknownBeta”: 0,
“score”: 1,
“unknownScore”: 1
}
“12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”
{
“totalCount”: 0,
“successCount”: 0,
“alpha”: 1,
“beta”: 0,
“unknownAlpha”: 1,
“unknownBeta”: 0,
“score”: 1,
“unknownScore”: 1
}

What’s wrong mith my node ?

Thanks.

nerdatwork · September 20, 2020, 7:24am

Can you double check if you are not using old identity?

mikebzh44 · September 20, 2020, 7:35am

I am not using an old identity because I start from scratch on a new PC and I didn’t kept the old identity files.

But when I look on logs, I guest I have a network issue :

2020-09-20T07:24:59.671Z ERROR orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB failed to settle orders for satellite {“satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “error”: “order: unable to connect to the satellite: rpccompat: dial tcp: lookup europe-north-1.tardigrade.io on 1.0.0.1:53: read udp 172.17.0.2:55305->1.0.0.1:53: i/o timeout”, “errorVerbose”: “order: unable to connect to the satellite: rpccompat: dial tcp: lookup europe-north-1.tardigrade.io on 1.0.0.1:53: read udp 172.17.0.2:55305->1.0.0.1:53: i/o timeout\n\tstorj.io/storj/storagenode/orders.(*Service).settleWindow:464\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrdersFromFileStore.func1:422\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57”}

But the same satellite is ok few minutes later :

kevink · September 20, 2020, 7:38am

are you sure that the node id in the email and on the current dashboard is the same? Because your audit scores looks ok.

Derkades · September 20, 2020, 8:55am

It looks like DNS is failing to resolve the hostname (udp port 53 = dns). Is DNS working?

mikebzh44 · September 20, 2020, 12:36pm

I think that DNS is OK but I have change it to use OpenDNS instead of my ISP

Pac · September 20, 2020, 4:05pm

As far as I know, only failed audit could disqualify your node.
Any entries in your logs containing both ‘GET_AUDIT’ and ‘failed’ on the same line?

mikebzh44 · September 20, 2020, 4:53pm

Yes :

2020-09-20T02:08:01.605Z INFO piecestore download started {“Piece ID”: “U2DUA6PFZNEIR4XQ3E5JBTPPAPD6WF4MXGYBNP7GEJ3XOUONSPXA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”}
2020-09-20T02:08:01.621Z ERROR piecestore download failed {“Piece ID”: “U2DUA6PFZNEIR4XQ3E5JBTPPAPD6WF4MXGYBNP7GEJ3XOUONSPXA”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Action”: “GET_AUDIT”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}
2020-09-20T06:38:03.752Z INFO piecestore download started {“Piece ID”: “Z7CNIPMPLALHNIPIGIUUTFIB4XYAC2C7WXZLQISCO6G3KDZG2WWA”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET_AUDIT”}
2020-09-20T06:38:03.769Z ERROR piecestore download failed {“Piece ID”: “Z7CNIPMPLALHNIPIGIUUTFIB4XYAC2C7WXZLQISCO6G3KDZG2WWA”, “Satellite ID”: “121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6”, “Action”: “GET_AUDIT”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}
2020-09-20T10:42:46.460Z INFO piecestore download started {“Piece ID”: “FY6RN52C63V5X7R2EIKZJ6CM6XW6DBIP5BVR34GYPZUJP53OCT5A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”}
2020-09-20T10:42:46.462Z ERROR piecestore download failed {“Piece ID”: “FY6RN52C63V5X7R2EIKZJ6CM6XW6DBIP5BVR34GYPZUJP53OCT5A”, “Satellite ID”: “12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB”, “Action”: “GET_AUDIT”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51”}

SGC · September 20, 2020, 6:56pm

looks like it cannot get access to the data… which could be a caused by a wide range of reasons… usb power saving modes on a external usb hdd, unmounting of the drive… reboot of the computer and the drive isn’t online… bad cables , super high latency (maybe), external hdd’s can also overheat because they are not built for 24/7 operation… your hdd might go into power save mode… your OS might spin down the disk to save power… i could go on… lol

saltlake is most likely just the most active satellite and thus the first to DQ you… the rest will follow if you don’t figure out what causes your host to loose access to the data…

mikebzh44 · September 20, 2020, 7:36pm

Thanks for all the ideas.

Data are stored on an internal HDD and the PC is under Debian so I don’t think that HDD is in power save mode.

As the PC is also used to mine ETH, there is no reboot.

I will check the cables.

SGC · September 20, 2020, 7:43pm

then it could be not mounted, or the /dev/sdX definition (not sure what it’s actually called) may drift between reboots, thus making a mount unable to successfully give access to the disk… for my zfs i use /dev/disk/by-id
but you can also use stuff like the gpt partition name i think it goes something like /dev/gpt/…

you could ofc just let it run for a while and wait for it to fail and then troubleshoot it when you can actually see whats wrong… that is assuming that it is an intermittent issue.

the disk could also be bad, or an SMR drive that runs into high latency issues at times after running for extended periods…

mikebzh44 · September 20, 2020, 7:49pm

The HDD was setup to spin down after 1270 sec idle (21 minutes).

I have disable spin down :

hdparm -B255 /dev/sda

/dev/sda:
setting Advanced Power Management level to disabled
APM_level = off

If it doesn’t fixe my issue, I will shutdown the node and I will make a new one on my NAS with fresh HDD.

Pac · September 20, 2020, 8:44pm

Well I think this would trigger ‘database locked’ errors, leading to suspension at worse. Not DQfication.

This is very unlikely to cause any issues other than failing to send a file in time when competing with other nodes that got selected for a user download (you’re simply gonna lose this race).
Besides, it must be really rare to see no activity at all for 20+ minutes, I’m not sure it happens even on a unvetted node, but I could be wrong.

AFAIK, when a sattelite audits a file, it waits long enough for a disk to spin up, unless your disk is very special and takes several minutes to start up

nerdatwork · September 20, 2020, 8:52pm

I think its 5 minutes waiting time.