Several "file does not exist" on different nodes

In the last few days I got several logs from different nodes that a GET failed with “file does not exist” error.

Example of one log message:

2022-08-26T19:51:33.562Z INFO piecestore download started {“Process”: “storagenode”, “Piece ID”: “2DABFN3WXSAMMR44YHRAUNECZVN4JR2OZRYS2RRQHLRWHN6ZM7NQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”}
2022-08-26T19:51:33.567Z ERROR piecestore download failed {“Process”: “storagenode”, “Piece ID”: “2DABFN3WXSAMMR44YHRAUNECZVN4JR2OZRYS2RRQHLRWHN6ZM7NQ”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Action”: “GET”, “error”: “file does not exist”, “errorVerbose”: “file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:546\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:228\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52”}

Due to the fact that this happened on 5 different nodes (physical hardware) i can not imagine that this is an error on my side. The satellite id was always 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S.

Crawling the log regarding the lifecycle of a piece I could not find any useful information. I keep logs only 10 days.

Any ideas? Or other people seeing this issue especially with the satellite 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S?

With regards to your logs though, I’d suggest keeping at least 1+ month of logs because some Storj metrics are based on a 30day window (namely the online score).

Personnally I keep 4 months of logs, and so far I never used more than 2 months of history for investigating issues, but you never know… And compared to the massive amount of storage Storj uses, keeping a few months of logs really isn’t a big deal in my opinion :slight_smile:


About your missing files errors:
During August, I have no “piecestore download failed” errors whatsoever. I do have “unable to delete piece” errors because of missing files, and they are not on a single satellite.

So I’m not sure, TBH.

Indeed. I’m keeping all of the history for my nodes, 5GB of xz-compressed logs. Not much to store, yet useful for investigating issues or doing experiments. Let’s not forget that storage nodes are still pretty young software, still with some teething issues.

I do observe these errors as well. Most are on this satellite, but I do have instances on other satellites as well.

I’ve tracked the lifetime of one such piece. It has been uploaded 25 days ago, downloaded successfully exactly 30 times (seems like a great piece to have!). First download failed observed 4 days ago, with 15 subsequent attempts also failing. I see no explicit deletion event, which is normal if the piece had either been deleted while the node was offline (unlikely, I had maybe few minutes of cumulative downtime during this month), or had been uploaded with an expiration time and deleted automatically by the node. I don’t know why would a client still see an expired file then, though.

Something interesting is happening here indeed. I’d understand one or two download faileds right after the expiration date, maybe having the client using some old download orders (or sth—not very familiar with this part of Storj code). Not sixteen, though.

Ok, checked more carefully. Actually all of these failures happened right next to each other, over the time span of 2 minutes. So if these 2 minutes were just after the piece expired, it would indeed make sense.

A post was merged into an existing topic: ERROR collector unable to delete piece

it’s happening again. Should I be worried about it?

Same satellite 12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S - but now GET_AUDIT with failed “file does not exist” errors. Will this impact my audit score?

I can confirm.
I’m getting this on nodes on different hardware, so this is NOT a disk/filesystem problem.

Until now the audit score hasn’t been impacted, but i don’t know if there is a lag.
According to the logs the first cases started to appear 2 hours ago.

You should be worried about it. “file does not exist” with GET_AUDIT will immediately affect the audit score. As soon as it fall below 95% the node will be disqualified. So please monitor your audit score.

Alexey is right, normally!

We are running some broad network tests right now and are reusing the GET_AUDIT request for the purpose of our test. Our test is running off of some slightly stale central metadata, so it may make some requests for data that doesn’t exist. I now see that by us reusing GET_AUDIT, we have given SNOs know way to know if this is one of the GET_AUDIT requests that matter, like Alexey points out, or one of these test ones that don’t and won’t affect your audit score.

We should be done soon, but in the meantime, just keep an eye on your audit score. If your node is fine, it should be fine!

7 Likes

1 hour of full FAILS - @jtolio so glad you answered this!!

I was doing all sorts of testing!

I have automation setup which checks the logs and i started getting emails about the same time on all 3 nodes i have, and all running independantly … i was quite confused :smiley:

To be honest, it worries me quite a bit to discover such tests are carried out on real nodes causing scary log entries that are normally extremely important, requiring close monitoring if not immediate action to avoid the worst (disqualification)…

Shouldn’t such tests be made on the test network instead?

How are we gonna differenciate between test error logs and legit’ error logs now?

5 Likes

Please reconsider this approach for future tests. This is giving your best and most attentive node operators a big scare. Many of us monitor audit and repair traffic carefully and seeing failures on it makes us fear losing our nodes. Since the impact on scores is only visible with a delay, it doesn’t really help that the scores still look fine.

4 Likes

Thank you @jtolio for making it clear promptly.

I can only agree with my previous speakers: Please rethink to do such tests in production. There are usually dedicated test stages (QSA) for this. Not only does your customers (the node operator) become nervous, since many really watch their logs and even be actively alerted so that they can become active in the event of an error - with such actions there is always the possibility that something goes really really wrong. Worst-case scenarios would be: disqualification of the nodes of the node operators or maybe (?) data loss if these disqualifications have a lot of nodes affected. I am not a mathematician, but would that be possible, what is the probability?

I work in a larger IT company - if we would do such tests without saying the OPS department, then all Hell would break loose :wink: We have the spoken word “The devil is a squirrel” (The devil comes in many forms) - so please “never ever do this in production”.

2 Likes

I agree with other operators. Operators are expected to monitor nodes on their own. Audit score is not updated frequently enough, and so alerting on log entries is the only choice to get feedback quickly enough. I’ve already pointed out several times that false alarms are a bad thing, please don’t make more of them.

I will also add to the voices above. I was getting ready to sleep, and seeing these statistics on my 8TB node that runs since the beginning of v3 really gave me an awful ~40 minutes. Numbers are low because logs get rotated when they exceed 100MB.
image

Ok, the Audit score is still 100%, now after finding this post I know everything is fine, the node is not in danger. Please don’t do this ever again.