Disqualified question

UofM_Matt · October 1, 2020, 5:24am

Greetings,

I’ve got two nodes, and saltlake is about to be disqualified on my second node (already on my first) because of audit. See my logs below. I’m confused as to why it can’t communicate with saltlake?

I have the error logs, but can’t post because I’m a new user limited to 2 links? Not sure what links are in the error log. I’ve tried to parse it a bit. The gist is that it says it can’t ping the satellite server and then gives what looks like errors in code. I can ping saltlake no problem from my server. How can I post my logs?

Ping stats:

user@host:~$ ping 78.94.240.189
PING 78.94.240.189 (78.94.240.189) 56(84) bytes of data.
64 bytes from 78.94.240.189: icmp_seq=1 ttl=43 time=220 ms
64 bytes from 78.94.240.189: icmp_seq=2 ttl=43 time=219 ms
64 bytes from 78.94.240.189: icmp_seq=3 ttl=43 time=213 ms
64 bytes from 78.94.240.189: icmp_seq=4 ttl=43 time=206 ms
64 bytes from 78.94.240.189: icmp_seq=5 ttl=43 time=221 ms
^C
— 78.94.240.189 ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 206.381/216.176/221.194/5.514 ms

Docker command, running on ubuntu 18.04 LTS:

docker run -d --restart unless-stopped --stop-timeout 300 -p 28967:28967 -p 127.0.0.1:14002:14002 -e WALLET=“0x0000000000000000000000000” -e EMAIL=“validemail” -e ADDRESS=“resolveableaddress:28967” -e STORAGE=“4TB” --mount type=bind,source="/var/data/identity",destination=/app/identity --mount type=bind,source="/var/data/storj",destination=/app/config --name storagenode storjlabs/storagenode:latest

Pentium100 · October 1, 2020, 5:49am

Audit fail means you lost data, not that the node cannot communicate with the satellite.

check your logs for lines with both “AUDIT” and “fail” in them.

nerdatwork · October 1, 2020, 5:53am

Welcome to the forum @UofM_Matt!

If you can’t post log then you can a take screenshot of the log and post it here.

Can you show the contents of this folder ?

baker · October 1, 2020, 1:54pm

Some of the log entries contain URLs or what looks to the forum software as URLs. You should be able to get around this if you post the logs and place three backticks with “text” above the block of logs (```text) and three backticks below the logs (```)

For example, this:

Produces this:

2020-09-28T10:30:25.319Z	INFO	piecestore	uploaded	{"Piece ID": "LE3CJVJDIBAY5LGOWQ4YOV6ZHCS6CALSY5XKAQU6QIXWQTPG5LRA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "PUT"}
2020-09-28T10:30:28.031Z	INFO	piecestore	download started	{"Piece ID": "LE3CJVJDIBAY5LGOWQ4YOV6ZHCS6CALSY5XKAQU6QIXWQTPG5LRA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2020-09-28T10:30:28.371Z	INFO	piecestore	downloaded	{"Piece ID": "LE3CJVJDIBAY5LGOWQ4YOV6ZHCS6CALSY5XKAQU6QIXWQTPG5LRA", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET"}
2020-09-28T10:30:30.332Z	INFO	piecestore	uploaded	{"Piece ID": "HSNCLW6YLQZDLGGY3PRXTGTYQE6JL2DR3MD7W5JRNHHMURP7OB2Q", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT_REPAIR"}

UofM_Matt · October 1, 2020, 5:46pm

Thanks for the tips. Here is the directory contents of /var/data/identity:

4 drwxr–r-- 3 mgar mgar 4096 Sep 27 07:52 .
4 drwxrwxr-x+ 8 root root 4096 Sep 27 07:52 …
24 -rw------- 1 mgar mgar 32768 Sep 27 07:52 revocations.db
4 drwxr–r-- 2 mgar mgar 4096 Sep 27 07:52 storagenode

and the error from earlier:

2020-10-01T16:39:16.653Z        ERROR   contact:service ping satellite failed   {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "attempts": 6, "error": "ping satellite error: rpccompat: dial tcp 78.94.240.189:7777: connect: connection refused", "errorVerbose": "ping satellite error: rpccompat: dial tcp 78.94.240.189:7777: connect: connection refused\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
```text

I didn't find any audit fail's, and the mounted directory is/has been available the whole time.  Thoughts?

nerdatwork · October 1, 2020, 5:58pm

This should be /var/data/identity/storagenode

Stop your node and edit the command.

leHans · October 6, 2020, 8:56pm

Hi @UofM_Matt did you manage to find out the cause? I got disqualified on 2 satelites yesterday after my node being up for over a year. Quite unexpected and the only “error” I also saw was similar to yours, ping failed. Which seems just like a temporary communication issue that could be caused by X things.

Alexey · October 6, 2020, 9:01pm

You can be disqualified only for lost or inaccessible data (4 times with 5 minutes timeout to give a few kb of piece for audit).

leHans · October 6, 2020, 9:17pm

I am quite confident I did not lose any data. That leaves me only option 2, although buffled how. (Suddenly, 2 satelites disqualify my node for timeout on the same day??)

Can I manually force a re-audit / prove that the data is indeed there? Would be happy to do this for the satelites that disqualified my node.

Alexey · October 6, 2020, 9:25pm

If your audit score is below 0.6 (60% on the dashboard), the disqualification is permanent and not reversible.
I can ask to check logs on satellites for your NodeID (it could take a time to get answer though).
If you have logs on that time, I would like to ask you to search for GET_AUDIT and failed in row.

leHans · October 8, 2020, 7:58am

On two satelites my audit score is now 0.59xxx so that checks out as a reason. I checked my logs, and can’t seem to find any GET_AUDIT and failed lines. What I do see are the likes of this,

ERROR orders.12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB failed to settle orders for satellite {"satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "error": "order: CloseAndRecv settlement agreements returned an error: context canceled", "errorVerbose": "order: CloseAndRecv settlement agreements returned an error: context canceled\n\tstorj.io/storj/storagenode/orders.(*Service).settleWindow:491\n\tstorj.io/storj/storagenode/orders.(*Service).sendOrdersFromFileStore.func1:422\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

And some ping fails,

ERROR contact:service ping satellite failed {"Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "attempts": 11, "error": "ping satellite error: rpc: context deadline exceeded", "errorVerbose": "ping satellite error: rpc: context deadline exceeded\n\tstorj.io/common/rpc.Dialer.dialTransport:211\n\tstorj.io/common/rpc.Dialer.dial:188\n\tstorj.io/common/rpc.Dialer.DialNodeURL:148\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:124\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:95\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/common/sync2.(*Cycle).Start.func1:71\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Or perhaps this, but here it looks raher that the download itself failed?

ERROR piecestore download failed {"Piece ID": "2BQBKZKJOKM4PLHMAQZQSGF7AV3XQQEIZJQD7A2WTRZKZT37DDHA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:74\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:505\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:1004\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:107\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:56\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:111\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:62\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:99\n\tstorj.io/drpc/drpcctx.(*Tracker).track:51"}

littleskunk · October 8, 2020, 9:34am

This is what gets you disqualified,

nerdatwork · October 8, 2020, 9:44am

This will be retried every 5 mins till its successful.

Ignore this error as this satellite has shutdown.

You are failing audits. Check your disk for errors.

leHans · October 8, 2020, 10:50am

I just checked the disk, SMART data does not report anything and also running a Windows file system check did not return any errors either. ( I presume having e.g. a single bad sector on the drive would cause storj to fail a single piece audit. If a new, random piece, audit is then requested, I would find it extremely unlikely to have a bad sector in the position of that specific new piece again. )

Is there anything else that can be checked?

Edit: Can I check the history of my audit score? (I would like to see if it was dropping slowly or suddenly on the satelites in question.)

nerdatwork · October 8, 2020, 11:06am

You should check using chkdsk and not sfc.

leHans · October 8, 2020, 12:04pm

Running chkdsk in read-only mode (not fixing errors) also did not find anything.

Windows has scanned the file system and found no problems.
No further action is required.

   7630882 MB total disk space.
   4065268 MB in 3974805 files.
   1605164 KB in 12327 indexes.
         0 KB in bad sectors.
   4292355 KB in use by the system.
     65536 KB occupied by the log file.
   3559854 MB available on disk.

I can run a complete disk surface check, but I doubt that it would give back anything else.

nerdatwork · October 8, 2020, 12:12pm

The intent of running the command is to fix something that may be broken so please run this command as administrator.

leHans · October 9, 2020, 11:53am

The aim here was to see if there was any error with the disk, potentially causing the issue, and not to fix it. But for the sake of completeness, I ran an 18h chkdsk full check with repair. No suprises: no error means nothing to repair and I got back the same result. (Ran as administrator at all times.)

Windows has scanned the file system and found no problems.
No further action is required.

   7630882 MB total disk space.
   4014782 MB in 3979698 files.
   1605164 KB in 12327 indexes.
         0 KB in bad sectors.
   4297219 KB in use by the system.
     65536 KB occupied by the log file.
   3610336 MB available on disk.

As a summary,

The drive itself is 100% fine, healthy, has no errors.
No data was deleted accidentally or otherwise from my part.
The node is running fine.

However, it was still disqualified on 2 satelites.

littleskunk · October 9, 2020, 12:33pm

You are seeing audit failures in your logfile. How do you explain that? I doesn’t match any of the other points you have mentioned.

leHans · October 9, 2020, 2:25pm

I know. This is why I am here, because I do not undertsand how this can happen.

I am going through all logs at the same time to see if the node is indeed running “fine”.