Error Codes: What they mean and Severity Level [READ FIRST]

vedalken254 · July 14, 2019, 8:09pm

Hey all! I wanted to take the opportunity to compile a list of error codes, what they mean, and if you need to worry about the error. I will be updating this as I am able to get information, but wanted to create a central place that people can check. Find the error you are looking for in the list and click the Summary section below it. This will open up with a description of what the error means and its severity.

11/30/2020-A quick note since I have discovered threads that reference errors in this post albeit with updated error text compared to when this post was created and all the errors that are present in the post were added: Look for the common terms. Even if the message isn’t the exact same, it will likely contain the same information as an error here. An example is how Context Canceled error is now reported as trust:rpc:Context Canceled rather than any of the previous Context Canceled errors provided in this post. I will try to take time over this upcoming weekend to look at all the different troubleshooting posts and update the post with the newer text for the errors.

UNIQUE Constraint failed:
2019-07-14T17:31:24.969Z        ERROR   piecestore internal: infodb: UNIQUE constraint failed: pieceinfo.satellite_id, pieceinfo.piece_id

Summary

You will only see this error when a satellite attempts to upload a piece to your node as part of a repair job, but your node already has the piece so it errors out as piece IDs are required to be unique. This is a normal error and requires no action from the SNO.

2019-08-31T23:35:57.036Z ERROR   server  piecestore protocol: rpc error: code = Internal desc = transport: transport: the stream is done or WriteHeader was already called

Summary

This error is much like Context Canceled except that it is almost guaranteed that other nodes were faster in receiving their pieces than yours. It is an informational error and should not be concerning for an SNO.

Upload Rejected:
2019-07-14T00:51:47.842Z        ERROR   piecestore      upload rejected, too many requests      {"live requests": 7}

Summary

This error results from a setting implemented in v0.14.9 and can be modified by the SNO. However, it does not impact your reputation so seeing it present is not indicative of a problem on the SNO side. 2 notes: 1)any ARM-based nodes should probably not go above 20 or 30 with the setting detailed below as it can overload them and 2)Windows users need to use Notepad++ to modify the required file as Notepad does not respect formatting. In order to modify this setting, stop your node and open config.yaml. This will be in the storage directory you specified in your run command.
Add this line to the bottom of the file (the number should be tweaked to your node’s performance):

storage2.max-concurrent-requests: 50

Save the file then start the node again to correct this informational error.

Context Canceled:
2019-07-14T19:54:19.570Z        ERROR   piecestore protocol: rpc error: code = Canceled desc = context canceled
2019-07-16T12:31:06.716Z        ERROR   piecestore internal: infodb: context canceled
2019-07-14T10:21:47.244Z        ERROR   piecestore internal: infodb: interrupted

Summary

Example: Upload failed in log files
Context Canceled indicates that your node was too slow in the transfer and the required number of nodes completed their transfer before you. This is a normal error to encounter and should only be concerning if you’re only getting these and errors regarding uploads/downloads starting and always failing. At that point, you will want to investigate your internal network for issues then check that your speeds meet the project requirements.

The interrupted error only shows up on uploads and occurs when a message that would trigger a context canceled error is received while the Storage Node is writing to the infodb and the connection to the db is immediately cut as a result.

Unexpected EOF:
2019-07-14T19:55:10.668Z        ERROR   piecestore protocol: unexpected EOF

Summary

This error is typically only encountered on uploads to the storagenode and is similar to the Context Canceled error in that it typically indicates a slower node but can also mean the uplink or satellite cancelled the request before the upload completed rather than other nodes being faster. This is an informational error and should not require any action from SNOs.

infodb: database disk image is malformed:
{"error": "infodb: database disk image is malformed", "errorVerbose": "infodb: database disk image is malformed\n\[tstorj.io/storj/storagenode/storagenodedb.(](http://tstorj.io/storj/storagenode/storagenodedb.()
* **ordersdb).ListUnsentBySatellite:151\n\[tstorj.io/storj/storagenode/orders.(](http://tstorj.io/storj/storagenode/orders.()** *Sender).runOnce:112\n\[tstorj.io/storj/internal/sync2.(*Cycle).Run:87](http://tstorj.io/storj/internal/sync2.(*Cycle).Run:87)\n\[tstorj.io/storj/storagenode/orders.(*Sender).Run:105](http://tstorj.io/storj/storagenode/orders.(*Sender).Run:105)\n\[tstorj.io/storj/storagenode.(*Peer)](http://tstorj.io/storj/storagenode.(*Peer))
.Run.func5:336\n\[tgolang.org/x/sync/errgroup.(*Group).Go.func1:57](http://tgolang.org/x/sync/errgroup.(*Group).Go.func1:57)"}

Summary

This error requires the SNO to perform the steps included in the link in order to recover the infodb database and is a critical error that requires immediate attention.

Voucher Errors:
ERROR	vouchers	Error requesting voucher{"satellite": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "voucher: unable to find satellite on the network: node not found", "errorVerbose": "voucher: unable to find satellite on the network: node not found\n\tstorj.io/storj/storagenode/vouchers.(*Service).request:127\n\tstorj.io/storj/storagenode/vouchers.(*Service).Request:116\n\tstorj.io/storj/storagenode/vouchers.(*Service).RunOnce.func1:103\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Summary

NOTE: As of v0.21.1, this message should no longer appear as kademlia was removed in v0.21.1

If your node is new, you may encounter this error for any of the four satellites and that is normal. If your node has been up for awhile (7d+) however, it is likely indicative of DNS or other networking issues. If you cannot find issues with your DNS, you may try stopping the node and renaming kademlia to kademlia.bak and then starting the node again. You will get this message upon start but that is due to your node now having an empty kademlia routing table. It may or may not go away.

ERROR server rpc error: code = PermissionDenied desc = info requested from untrusted peer

Summary

BrightSilence

8m

With the latest update getting node info is restricted to only trusted nodes. When someone else tries to retrieve this information, you’ll see this in your log.

Nodestats:cache messages:
ERROR nodestats:cache Get disk space usage query failed {"error": "node stats service error: rpc error: code = PermissionDenied desc = node not found"}
ERROR nodestats:cache Get stats query failed {"error": "node stats service error: unable to connect to the satellite"}

Summary

If the node ID that was not found matches your node ID, this is a temporary message caused by the fact that your node is new enough that not all of the satellites know of your node yet. It should go away with time.

For the second error, this error appears if the satellite is down or otherwise unavailable.

Download Errors:
2019-08-29T15:54:15.647Z INFO piecestore download failed {"Piece ID": "AXNYNZLQSU6FTH55AJPWK34BQCFDWG5EWFBPTNOVLOHA2KVXUT4Q", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET", "error": "piecestore: piecestore protocol: rpc error: code = Unavailable desc = transport is closing", "errorVerbose": "piecestore: piecestore protocol: rpc error: code = Unavailable desc = transport is closing\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download.func3:504\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}
2019-12-15T19:56:00.530Z        INFO    piecestore      download failed {"Piece ID": "25JGFFHGSZBEHEQMTYZ5QUWIAVCN5CCHDGX5O2VGJ7BXFQAK66IQ", "Satellite ID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET", "error": "piecestore: piecestore protocol: write tcp 172.17.0.2:28967->[redacted]:36716: use of closed network connection", "errorVerbose": "piecestore: piecestore protocol: write tcp 172.17.0.2:28967->[redacted]:36716: use of closed network connection\n\tstorj.io/drpc/drpcstream.(*Stream).pollWrite:189\n\tstorj.io/drpc/drpcwire.SplitN:25\n\tstorj.io/drpc/drpcstream.(*Stream).RawWrite:233\n\tstorj.io/drpc/drpcstream.(*Stream).MsgSend:266\n\tstorj.io/storj/pkg/pb.(*drpcPiecestoreDownloadStream).Send:1078\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).doDownload.func3:598\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Summary

When either of these info messages appear, either the uplink cancelled the download request or 29 other storagenodes completed their transfers before yours could. Functionally, this is identical behavior to the Upload failed Context Cancelled errors. No action is required from the SNO as these messages are normal to see.

DRPC errors:
2019-11-02T02:56:42.455Z        INFO    piecestore      download failed {"Piece ID": "5STYD4QTAXWG7VFYV5Y2DSQE4E7F4D22H5IVFGYDE46NZ6IBNWKQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET", "error": "piecestore: piecestore protocol: drpc: stream terminated by sending error", "errorVerbose": "piecestore: piecestore protocol: drpc: stream terminated by sending error\n\tstorj.io/drpc/drpcstream.(*Stream).SendError:261\n\tstorj.io/drpc/drpcmanager.(*Manager).manageStream:224"}

Summary

This info message will appear whenever either the Storage Node or the Uplink unexpectedly terminates the connection on a download. This is a generic error message and does not inherently mean that there is a problem that requires SNO intervention.

2021-11-03T21:15:01.649-0400 WARN contact:service Your node is still considered to be online but encountered an error. {“Satellite ID”: “12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo”, “Error”: “contact: failed to dial storage node (ID: ***) at address ***:28967 using QUIC: rpc: quic: timeout: no recent network activity”}

Summary

This is mean that you did not setup UDP in your docker run command and did not forward UDP

Some Errors(Warns) at log - #4 by BrightSilence

I know I’ve probably missed a few errors, but if you’d like to post them, I will find out and update my list here.

Dylan · July 15, 2019, 8:07pm

Awesome post, very helpful!

vedalken254 · July 16, 2019, 7:25am

Glad to be of some help at the very least. I do have a recommendation for when RocketChat is in the Sunsetting phase: Could we get this post to be a global pin or at least stickied in the Storade Node Operators Category? I think that would be ideal for people wanting to know about error codes in order to not make repeat posts.

SRS · July 16, 2019, 12:58pm

What is the difference between these two:

“ERROR piecestore internal: infodb: context canceled” and “ERROR piecestore protocol: rpc error: code = Canceled desc = context canceled”

And what does this mean?
ERROR piecestore internal: infodb: interrupted

vedalken254 · July 16, 2019, 1:31pm

That is a very good question on both counts. I would assume that context canceled regardless of infodb or protocol would mean too slow, but I will confirm. I will also check on the interrupted message.

Alexey · July 16, 2019, 1:50pm

The context canceled meaning the same thing - your node too slow. It can be thrown from a different lines of code. In case of database - your node too slow to receive an info from the database. It could be because your storage is slow (for example - network connected drive) or overall performance of the system is not enough to process all requests.

Your node received “context canceled” exactly in the middle of the request to the database and connection has been interrupted immediately.

vedalken254 · July 16, 2019, 1:58pm

The infodb error for context canceled actually only shows up on uploads, not downloads. The protocol one shows up on both uploads and downloads, and interrupted shows up on uploads only as well. I will update the wiki notes though.

EDIT: Updated

ifraixedes · July 16, 2019, 2:35pm

This may happen because an upload has been cancelled by the uplink before it has completed.

BrightSilence · July 16, 2019, 7:51pm

The infodb needs to do more for an upload than for a download. So the likelihood of a context cancelled happening during an infodb operation is much higher on uploads.

vedalken254 · July 16, 2019, 10:18pm

Thanks for the explanation on that. That’s what I needed.

KernelPanick · July 17, 2019, 1:30am

My data confirms this. I was likely to have a 30-35% higher failure rate for repair uploads. However, it did not correlate the same with a standard upload.

Alexey · July 17, 2019, 9:46pm

3 posts were split to a new topic: ERROR untrusted: trust:: context canceled after update to v0.15.2

Alexey · July 17, 2019, 10:34pm

2 posts were split to a new topic: Authentication handshake failed: tls peer certificate verification error: tlsopts error: peer ID did not match requested ID

Chris21788 · July 17, 2019, 9:45pm

Is this error of any concern, or is it because/causing a failed upload?

2019-07-17T21:44:56.889Z ERROR piecestore protocol: rpc error: code = Canceled desc = context canceled
storj.io/storj/storagenode/piecestore.(*Endpoint).Upload:238
storj.io/storj/pkg/pb._Piecestore_Upload_Handler:701
storj.io/storj/pkg/server.logOnErrorStreamInterceptor:23
google.golang.org/grpc.(*Server).processStreamingRPC:1209
google.golang.org/grpc.(*Server).handleStream:1282
google.golang.org/grpc.(*Server).serveStreams.func1.1:717

Alexey · July 17, 2019, 10:32pm

This is standard “context canceled” error manning your node too slow.

Chris21788 · July 17, 2019, 10:51pm

Too slow in what way? I have 1GBps down/up internet, and manage well over 475MB/s write speeds on these disks.

Alexey · July 17, 2019, 11:20pm

Against competitors. In case of uploads other 80 has been faster than your node to receive a piece.
The speed is matter between your node and customer’s uplink

Alexey · July 17, 2019, 11:50pm

A post was merged into an existing topic: Why did you choose to use a sqlite db?

BlackDuck · July 18, 2019, 4:47am

1Gbps down/up ist only one variable, your throughput depending on latency to the node too.