Audits stopped. Why?

What can be reason that AUDITs are stopped? Last one received 2 days ago.

[root@localhost ~]# grep AUDIT /storj/node.log | tail -1
2023-01-10T18:53:36.690Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "7B7LBJRFHS3FYX7AKXMG3NP35SDT3Q5U7NLMV5EU7I47MFFPX6DQ", "Satellite ID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Action": "GET_AUDIT"}
[root@localhost ~]#

Hello @binary

Can you please provide some more information to help the team understand your situation?

The last 20 lines of your log?
What does your dashboard show?
Is the node suspended or disqualified on any satellites?
Is the node offline?

Thank you.

1 Like

Node is not disqualified or suspended. Node is not offline.

The only thing worth to mention is that month ago I had issues with disk and had to replace it. Replication to another disk took a while, so there was a noticeable downtime (see graphs). That resulted into low “online” score, but this score should have started to grow up exactly now…

These are failed audit windows month ago:

[root@localhost ~]# ./storj_audit.sh | grep windowStart | grep -Po "(?<=\s\")\d{4}-\d{2}-\d{2}(?=T)" | sort | uniq -c
     10 2022-12-13
      4 2022-12-14
      3 2022-12-15
      5 2022-12-16
      2 2022-12-17
      1 2022-12-18
     10 2022-12-19
      9 2022-12-20
      9 2022-12-21
      7 2022-12-22
      1 2023-01-08
[root@localhost ~]#

Dashboard:

Last 20 lines of log:

[root@localhost ~]# tail -20 /storj/node.log
2023-01-12T20:20:11.378Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "JKSZDV523IN2H2J2FEGUCTKAJPPJDHCHKI3GYOQMIXCMC5IU4ZDA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 13824}
2023-01-12T20:20:11.978Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 2326740932240}
2023-01-12T20:20:12.118Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "XI4CBCSVQWZDXBPWZZXC6GJQC2TWXD3NMNTEPYMSI4CG35SOTMXA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:12.162Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 2048}
2023-01-12T20:20:12.895Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "6FWTGIIBQXSO4VON27SCMWDX46FAOAHL6ORMJMJ6NMNYO4IDRWDA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 2326740929680}
2023-01-12T20:20:12.987Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "6FWTGIIBQXSO4VON27SCMWDX46FAOAHL6ORMJMJ6NMNYO4IDRWDA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Size": 2304}
2023-01-12T20:20:15.739Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "RUQPVABHYKYTRZC3DT37RDQ4RUTH6YHIJBHAVNQQP24ZCJRTKTSQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Available Space": 2326740926864}
2023-01-12T20:20:17.324Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.339Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.368Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.426Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.474Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "BHPE4CZU25GUN3OTDSHP5CNZ45I6XTUTRL25LQM3XKULTZICHWRQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Available Space": 2326740926864}
2023-01-12T20:20:17.498Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.522Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.558Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.595Z        INFO    piecestore      downloaded      {"Process": "storagenode", "Piece ID": "5JLQC5WW7JFAWKMSRW5TDHHEVKROY5UFTHZRHTVROK4UFSFG6RMQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "GET"}
2023-01-12T20:20:17.696Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "BHPE4CZU25GUN3OTDSHP5CNZ45I6XTUTRL25LQM3XKULTZICHWRQ", "Satellite ID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "Action": "PUT", "Size": 108288}
2023-01-12T20:20:18.385Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "BSOCAIHZSRR7JE7YZUDZN7A4R2JOTE2NMWYXBR434XRF2PLBQWNA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 2326740818064}
2023-01-12T20:20:19.142Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "BSOCAIHZSRR7JE7YZUDZN7A4R2JOTE2NMWYXBR434XRF2PLBQWNA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 2319360}
2023-01-12T20:20:24.685Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "E2NID6NJZAWWKFNLY2ZVHORCREDQ4BYKWESVXZUVCC5OZC6TQXNA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 2326738498192}
[root@localhost ~]#

@bre

Also this might be useful: my node’s sliding window for online score calculations is not sliding anymore, it got stuck:

Interesting. Last audit on my node is

2023-01-12T16:27:44.286Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-12T16:27:44.392Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:621\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

Last log entry:

2023-01-12T21:43:09.703Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "OL4J3OCQ2WIK2OE7O6TGOOTUGTPSVMAAGXFALJHPTWG3C2NMRAZQ", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET"}

So, a few hours without audits.

Last audit is a failure?

2023-01-09T13:11:08.758Z        INFO    piecestore      upload started  {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 2887228797102}
2023-01-09T13:11:29.757Z        INFO    piecestore      uploaded        {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 2319360}
2023-01-12T16:25:13.237Z        INFO    piecedeleter    delete piece sent to trash      {"Process": "storagenode", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA"}
2023-01-12T16:25:49.677Z        DEBUG   piecestore      failed to stat piece    {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "filestore error: file does not exist", "errorVerbose": "filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:102\n\tstorj.io/storj/storagenode/pieces.(*Store).Stat:733\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Exists.func2:235\n\tstorj.io/common/sync2.(*Limiter).Go.func1:49"}
2023-01-12T16:27:44.286Z        INFO    piecestore      download started        {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT"}
2023-01-12T16:27:44.392Z        ERROR   piecestore      download failed {"Process": "storagenode", "Piece ID": "HGDM2MKQVD577S2HV4DNES6FG5KQZW6HUE3QEAGITN5URQMP7NJA", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "GET_AUDIT", "error": "file does not exist", "errorVerbose": "file does not exist\n\tstorj.io/common/rpc/rpcstatus.Wrap:73\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Download:621\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func2:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:122\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:112\n\tstorj.io/drpc/drpcctx.(*Tracker).track:52"}

Satellite tried to audit a deleted piece and then crashed?

Hey Everyone,

We have temporarily paused audits as we prepare for our new scaled audit system to roll out early next week. This new system will significantly increase the number of audits we can perform. As a result, node vetting is also paused and will resume early next week. With the new system, node vetting will be significantly faster.

Thank you for your patience as we roll these changes out.

12 Likes

Cool. Can you tell us how is it going to be different? The link to the design docs requires a password.

1 Like

Sorry about the link- it’s just pointing to an older version of the blueprint, which can be found here: storj/audit-scaling.md at 0c3dd444909a3d9a1fb767ae03b8cf5dd7ae1435 · storj/storj · GitHub. It was previously discussed here back in July/August.

The only change that should affect storage node operators is that we should be able to do more audits, which means nodes should get vetted faster and reputation changes may happen more quickly.

2 Likes

That must mean now is a good time to do maintenance? I have a 2-3 hour downtime session I’ve been saving for when my last maintenance scores resolve. Sounds like now may be a better time as to not miss audits.

This worries me a little. I had a recent SSD cache failure that left my system unresponsive, but apparently responsive enough to not be considered offline. So instead it started failing audits. Due to unfortunate timing and a pesky thing called sleep, I only noticed this 9 hours later. By that time my lowest audit score was 99.6, so not too bad. But if there would have been 10x more audits during that time, my 20+TB node wouldn’t have survived that. Even though it was just an intermittent issue that I quickly resolved after that.

What makes this worse is that the number of audits scales with the amount of data stored, so the bigger the node, the more quickly such a situation leads to its demise. Not exactly ideal.

Luckily since then my scores have been recovering and they’re all over 99.9 now. Guess it will be a while until they recover further if audits are now paused.

I didn’t suffer any data loss as the SSD cache was redundantly implemented and write cache could be recovered from the second SSD. So there was no permanent damage. But intermittent stalls are still dangerous and will become a lot more dangerous with this change.

5 Likes

That might have been while the repair system was still giving feedback to reputations. We turned that off a short while ago, because it was causing reputation change to happen too quickly in situations such as those you mention.

If we start having reputation change happen too quickly with the new rate of audits, I expect we will re-tune the reputation parameters (or rework the system yet again) in order to slow the reputation change back down. I don’t think we will ever need to be disqualifying nodes for ~10 hours of degraded service.

1 Like

I have my doubts that repair was involved as that would have definitely led to disqualification. My node was certainly failing all audits during that time the system was stalled and there would have been way too many to survive if repair impacted it as well. I think I remember someone mentioning repair timing out wasn’t counted as an audit failure. My guess is that the failures I had were audit timeouts and consistent timeouts in containment mode.

Now I also think that slowing down the audit system to make this survivable is a workaround at best. There should be something else to detect stalls like this and just mark the node as offline if it doesn’t respond to anything. I was just mentioning it because speeding up audits while this problem still exists will make it a lot worse.

There is an argument to be made to run audits by selecting random nodes for audit and then select a random piece to audit. Meaning each node gets audited the same amount of times. This would eliminate scaling with data and make the risk at least equal across nodes of different sizes. Alternatively if you want to get fancy about it, you could calculate a representative sample size based on the number of pieces for the node and use that in a weighted random selection to select nodes for audit. That way there is still some scaling, but it would be logarithmic instead of linear with node size. Which would remove the largest part of the increased risk for larger nodes. And would also be a more efficient way to do audits, while maintaining reliable reputation scores.

4 Likes

Can i ask a quick question…

Is there an advantage on running off ssd drives and not enterprise or NAS drives?

Definitely. Because we are humans and sometimes we need to sleep. So we cannot monitor our nodes 24/7.

5 Likes

There’s always some advantage, but it’s miniscule and not worth the cost. 90% of those advantages could also be achieved with SSD cache, but even that is in many cases not worth the cost. My situation is a little special as I have multiple nodes as well as other things (not Storj related) running on a single array. This array was SSD accelerated before I even started running Storj nodes, so it was just a free bonus. But I’ve scaled the usage to the performance with the SSD cache and now I can’t (or really shouldn’t) go without anymore. If you’re running just Storj, it would be much better to just run individual nodes on individual disks and as long as they aren’t SMR disks, you would probably see little benefit from SSDs.

I should also add, the bigger advantage when using SSD cache is the write cache, which should always be redundantly implemented since SSDs wear out and they don’t die slowly. So you’d be looking at the cost of adding 2 SSDs. The only reason I didn’t have permanent data loss is because that’s how it was set up on my system. (Though I’m a little pissed that Synology doesn’t deal with a single SSD failure gracefully and just freezes and refuses to boot until you remove the dead SSD. Leaving you to figure out what actually went wrong. Took me a while to find out it was the SSD cache and then find which SSD was broken.)

1 Like

I have two nodes on two cmr hdds and when the trash starts the nodes start to lose runs. I went from 98/99% to 70% and the Quick switches to misconfigured. It happened both in early December and early January. I think the minimum requirements should include either more ram or some additional cache.
Among other things, since the beginning of January they are down by 70 Gb.

Those GC runs were a lot bigger than normal due to GC having been switched off for a while. But even if you drop to 70% ingress only during GC runs, how much income do you really lose? I doubt it’s worth the cost of an SSD or additional RAM.

I went from $23 to $14/$15 estimated this month. I started with 5.66Tb and now I’m at 5.59, on the 13th of the month…

Difficult to determine because the dashboards work in fits and starts

Qnap Ts230
Processore Realtek RTD1296 4-core 1,4GHz
DDR4 da 2 GB
2x WD Red 8 TB

It’s not your slow system to blame. It’s happening to many of us, I think, depending on geo position, because the amount of new nodes is huge these last 2 months. I don’t know what made them come onboard in so big numbers, but it affects ingress and engress of all of us.
All my 8 nodes are decresing in space occupied, because, I guess, data deleted is more than ingress data. Last month I hit 8TB on my oldest node, and now is down to 7.5TB. The hardware is not a problem. They are all single enterprise HDD with 6GB RAM, the filewalker takes like 20 min. As I discovered recently with the help of other SNOs, the more RAM the better, if you want to improve your nodes, but dosen’t provide huge benefits as earnings; and slow hardware dosen’t explain the halfed earnings this month. Only the influx of newcommers.

That said, Storj needs to increase efforts to onboard new customers…