Suspension score dropped with no errors, then reverted

Storgeez · February 17, 2022, 8:36pm

What would be the reason for suspension score dropping with no log entries shown with “grep GET_AUDIT | grep failed”?

Online score on satellite I’m least online on at 99.3%, all audits at 100%. Suspension scores on satellites below 100% at 94.26%, 99.99%, 99.47%.

Storgeez · February 17, 2022, 9:37pm

Well now they’re all back to 100%, what the heck…

Alexey · February 18, 2022, 2:15am

Try to search for ERROR, they should catch it.
And also, if you re-created the container, the previous logs are lost and now you have only fresh logs without errors.

Storgeez · February 18, 2022, 8:50am

I did also search with “grep GET_AUDIT | grep error", no results. This is on logs that go back months, last error was in 10th month…

Alexey · February 18, 2022, 8:54am

It’s case sensitive, you need to search for ERROR or use -i option to ignore case.

Storgeez · February 18, 2022, 9:31am

Right, right, that was my bad on the case!

“grep GET_AUDIT | grep ERROR” return the same results.

jammerdan · February 18, 2022, 9:41am

Something similar happend to me: Now it hit me: 'Your node has been suspended' - #4 by jammerdan

It is very possible that your node was so busy or whatever that it could not even log anything.

Storgeez · February 18, 2022, 10:18am

It is possible I was doing something on the computer that bogged down the resources but I fail to see how it would be possible to not generate the log entry. If my understanding is right you would need to have an issue with returning audit data - drop the audit request packets on the network side; or receive and fail to respond, caused by bogged down disk, which would generate a log entry; or the node to crash which would be obvious. Looks like something in-between happened.

Stubbsey · February 18, 2022, 10:39am

I had this happen on 2 separate windows nodes (on different machines) yesterday. Both machines are solely used for storj and when i check the logs there was no errors at all. Checked nodes this morning both nodes recovered from around 54% back to 100%.

jammerdan · February 18, 2022, 10:44am

Lol. Same here yesterday. 2 nodes suspended due to heavy disk usage. But I did not bother to find a cause as the suspended score recovered almost immediately when the disk relaxed again.

Storgeez · February 18, 2022, 11:03am

Yeah, looks like a peak yesterday, though I was shuffling some data around in the background as well. But there should be a log entry, I’m curious to know what went wrong, unless satellite got overwhelmed or something.

Alexey · February 18, 2022, 5:31pm

One of my nodes have these errors:

github.com/storj/storj

[storagenode] GET_REPAIR v0pieceinfodb: sql: no rows in result set

opened 05:31PM - 18 Feb 22 UTC

AlexeyALeonov

Bug

**Description** The suspension score is dropped on one of my nodes. There are… several errors such as: ``` 2022-02-17T00:34:41.175Z ERROR piecestore download failed {"Piece ID": "UUDW6P4WRQWCR7S6DR2DSI4DSRZCZCWQGA 6LS33H4T366K3UNOQA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpc ctx.(*Tracker).track:52"} 2022-02-17T17:42:03.236Z ERROR piecestore download failed {"Piece ID": "NBJ4NADIEEBHSHNSL22QCQWB2KWGP3KYYS GS25GFVLV4TQA4HL5A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpc ctx.(*Tracker).track:52"} ``` **Steps to reproduce the issue:** 1. Run the latest storagenode 2. Search for ERROR in GET_AUDIT or GET_REPAIR 3. Catch errors like in the **Description** **Describe the results you expected:** The bug/error should be fixed to do not have the suspension or audit scores are affected. **Describe the results you received:** The suspension score is affected. **Logs:** ``` grep -a ERROR /mnt/x/storagenode2/storagenode.log | grep -E "GET_AUDIT|GET_REPAIR" | tail ``` ``` 2022-01-24T11:18:29.856Z ERROR piecestore download failed {"Piece ID": "25QLBFEEJQNWKXYSWEOJXX6JVHLPAWMI3U RC7622L5M657UUUTSQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcc tx.(*Tracker).track:52"} 2022-01-29T22:59:52.308Z ERROR piecestore download failed {"Piece ID": "KHGTEZZM3JQK2D7OCVCBQKAUVTS4PZRBUK 2NTA6W5W6WTEMMZLYQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:104\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:60\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:97\n\tstorj.io/drpc/drpcc tx.(*Tracker).track:52"} 2022-02-17T00:34:41.175Z ERROR piecestore download failed {"Piece ID": "UUDW6P4WRQWCR7S6DR2DSI4DSRZCZCWQGA 6LS33H4T366K3UNOQA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpc ctx.(*Tracker).track:52"} 2022-02-17T17:42:03.236Z ERROR piecestore download failed {"Piece ID": "NBJ4NADIEEBHSHNSL22QCQWB2KWGP3KYYS GS25GFVLV4TQA4HL5A", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET_REPAIR", "err or": "v0pieceinfodb: sql: no rows in result set", "errorVerbose": "v0pieceinfodb: sql: no rows in result set\n\tstorj.io /storj/storagenode/storagenodedb.(*v0PieceInfoDB).Get:131\n\tstorj.io/storj/storagenode/pieces.(*Store).GetV0PieceInfo:6 88\n\tstorj.io/storj/storagenode/pieces.(*Store).GetHashAndLimit:468\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint ).Download:563\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRP C:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:58\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tst orj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpc ctx.(*Tracker).track:52"} ``` **Your environment** - Operating system and version: Windows 10 Pro 10.0.19043 N/A Build 19043 - Additional environment details (Raspberry PI, Docker, VMWare, etc.): Windows PC, - Docker desktop ``` docker version Client: Cloud integration: v1.0.20 Version: 20.10.10 API version: 1.41 Go version: go1.16.9 Git commit: b485636 Built: Mon Oct 25 07:47:53 2021 OS/Arch: windows/amd64 Context: default Experimental: true Server: Docker Engine - Community Engine: Version: 20.10.10 API version: 1.41 (minimum version 1.12) Go version: go1.16.9 Git commit: e2f740d Built: Mon Oct 25 07:41:30 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.11 GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8 runc: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0 ```

Storgeez · February 18, 2022, 7:35pm

So the GET_REPAIR failures are dropping the suspension score?

I don’t seem to have any though. I tried looking for entries with ERROR && GET_REPAIR, excluding (“use of closed” && “broken pipe”) and didn’t get any such GET_REPAIR errors in the last 4 weeks. Didn’t get any GET_AUDIT errors altogether so I don’t know what should be wrong.

Also I didn’t find any of Alexey’s errors.

Alexey · February 19, 2022, 2:32am

And also audit score, if failed with known error.

YourHelper1 · February 19, 2022, 9:56am

Hello, is there any case that the whole network faces problems? I don’t think my connection went off any time and i even got suspended on one satellite with this percentages:

Suspension

59.19 %

Audit

100 %

Online

99.94 %

I also have 0 GET_AUDIT errors and just one GET_REPAIR …
What is happening?

Alexey · February 19, 2022, 1:51pm

Unlikely. But there is can be a bug somewhere.
Please, try to search for ERROR event. If you use a docker version and if you did not redirect logs, then likely your logs are gone after re-creation of the container and there is no traces anymore.

YourHelper1 · February 19, 2022, 2:07pm

No i checked even previous logs and as i said there are not any more logs with errors than those i said above. I run the command in all logs that were used while i was offline and i know that because the logs before don’t have any errors.

From a statistical point of view though, watching all those guys getting suspended lately yeah its pretty obvious that there are some bugs going on

Stob · February 19, 2022, 3:46pm

Yup, definitely more likely there’s some system bugs affecting less than 0.1% of all nodes rather than those 0.1% of nodes have some kind of internet or node issue

Alexey · February 20, 2022, 2:14am

This is weird. You should have some errors anyway. The suspension score should not be affected silently.
It maybe database locks or something other, but must be.

Storgeez · February 20, 2022, 3:53pm

That’s what I thought, there must be some log entries, but based on what I searched I could not find any either. Although mine did not drop nearly as drastically.