Vetting / audit process duration

cdhowie · September 5, 2019, 2:20am

I’ve got a node running for about a week with ~400MB of data stored (which I understand is about average at this point for a node still being vetted). I read that a node is still considered “undergoing vetting” until it has passed 100 audits.

Since the node was launched, I’ve received a grand total of one audit from a single satellite. If this progresses linearly, it will take about 1.9 years to become vetted.

Will the frequency of audits increase at some point? About how long on average is the vetting process supposed to last? I wasn’t able to find this information in any of the documentation or other forum posts.

Alexey · September 5, 2019, 8:04am

It could be possible that your node is paused if you managed to lost data or it’s offline.
You can check yourself:

Either you can post your NodeID here or open a support ticket on https://support.storj.io

BrightSilence · September 5, 2019, 8:09am

It’s also been quiet on the network for the past week, and while being vetted you get much less data. The upside is that the number of audits scales with the amount of data stored. So as long as that keeps growing you’ll likely see more audits soon.

Some changes are also in the works to send more audits to unvetted nodes, but I don’t think that has been implemented yet. So use the links @Alexey posted, but if that all looks good, just keep it online and it’ll get better soon.

cdhowie · September 5, 2019, 11:52am

In fact, in that thread you’ll see a script I wrote while troubleshooting this. Here is the output:

{
  "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs",
  "audit": {
    "totalCount": 0,
    "successCount": 0,
    "alpha": 1,
    "beta": 0,
    "score": 1
  },
  "uptime": {
    "totalCount": 1374,
    "successCount": 1370,
    "alpha": 99.99971608108261,
    "beta": 0.00018529472652453104,
    "score": 0.9999981470509073
  }
}
{
  "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S",
  "audit": {
    "totalCount": 1,
    "successCount": 1,
    "alpha": 1.95,
    "beta": 0,
    "score": 1
  },
  "uptime": {
    "totalCount": 1251,
    "successCount": 1241,
    "alpha": 98.38701907627873,
    "beta": 1.6126414121994246,
    "score": 0.9838735311267859
  }
}
{
  "id": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW",
  "audit": {
    "totalCount": 0,
    "successCount": 0,
    "alpha": 1,
    "beta": 0,
    "score": 1
  },
  "uptime": {
    "totalCount": 1387,
    "successCount": 1373,
    "alpha": 99.28534492966931,
    "beta": 0.7145685255298282,
    "score": 0.9928543085604773
  }
}
{
  "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6",
  "audit": {
    "totalCount": 0,
    "successCount": 0,
    "alpha": 1,
    "beta": 0,
    "score": 1
  },
  "uptime": {
    "totalCount": 1176,
    "successCount": 1167,
    "alpha": 99.21963414297048,
    "beta": 0.7796443926023782,
    "score": 0.9922034998250009
  }
}

BrightSilence · September 5, 2019, 12:42pm

Those numbers look fine.

Did you also check upload success rates to make sure you’re not failing too many uploads?

IOwnCalculus · September 5, 2019, 4:13pm

I’m seeing similar audit results for a node that’s been up just under a week:
(edit: not sure why this is rendering as a couple giant blobs of text?)

{ "id": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "audit": { "totalCount": 0, "successCount": 0, "alpha": 1, "beta": 0, "score": 1 }, "uptime": { "totalCount": 1624, "successCount": 1621, "alpha": 99.74912849151943, "beta": 0.25086351414993935, "score": 0.997491364657952 } } { "id": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "audit": { "totalCount": 1, "successCount": 1, "alpha": 1.95, "beta": 0, "score": 1 }, "uptime": { "totalCount": 1138, "successCount": 1136, "alpha": 99.62463857558707, "beta": 0.37430441822989813, "score": 0.9962569162530743 } } { "id": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "audit": { "totalCount": 0, "successCount": 0, "alpha": 1, "beta": 0, "score": 1 }, "uptime": { "totalCount": 1408, "successCount": 1405, "alpha": 99.73676180530727, "beta": 0.2631681169555951, "score": 0.9973683169862201 } } { "id": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "audit": { "totalCount": 0, "successCount": 0, "alpha": 1, "beta": 0, "score": 1 }, "uptime": { "totalCount": 1757, "successCount": 1756, "alpha": 99.99617935571987, "beta": 0.003818544068783942, "score": 0.9999618145585102 } }

I don’t have logs redirected to a file yet so this only represents the last 17 hours after a container reboot to update Docker, but it seems like I’m getting a significant number of upload failures for some reason:

========== AUDIT ============= Successful: 0 Recoverable failed: 0 Unrecoverable failed: 0 Success Rate Min: 0.000% Success Rate Max: 0.000% ========== DOWNLOAD ========== Successful: 72 Failed: 10 Success Rate: 87.805% ========== UPLOAD ============ Successful: 31 Rejected: 0 Failed: 28 Acceptance Rate: 100.000% Success Rate: 52.542% ========== REPAIR DOWNLOAD === Successful: 0 Failed: 0 Success Rate: 0.000% ========== REPAIR UPLOAD ===== Successful: 0 Failed: 0 Success Rate: 0.000%

cdhowie · September 5, 2019, 6:12pm

I do have a lot of upload failures. Based on what I’ve read in the forums, I attribute this to the satellite being in a different geographic region and therefore uploads complete to other nodes faster, and it’s also basically the only satellite that I seem to get upload requests from.

My network connection is fine and should be plenty fast enough to handle decent volumes of incoming data (400Mbps downstream) and the underlying storage can easily write 64MB in under a second. Latency to uploader is the only thing that makes sense.

(My node just auto-restarted for v0.20.1 so the successrate script is only returning data from when it was restarted, showing 50% upload success out of 2 uploads. Not very helpful yet.)

BrightSilence · September 5, 2019, 6:34pm

Yeah, usually the most you can do on your end is reduce io latency. So use direct attached storage. No USB drives or network connected storage over NFS/SMB etc. Other than that, just wait it out.

cdhowie · September 5, 2019, 7:04pm

The storage is on an LVM LV on an md-raid RAID1 of HDDs. Presumably adding an SSD writeback cache would help, but the drives can burst write 64MB pretty fast…

greener · September 5, 2019, 9:44pm

I’m seeing the same with the node running for ~10 days.
What I don’t understand is that I see most traffic in logs is coming from/to 118UWp which only issued one single audit. At the same time, there were 47 audits for 12EayR with barely any traffic.
I’m running watchtover since the start and node is up to date, no downtime. I got 1.4 GB on the disk so far.

# api() { curl -s 127.0.0.1:14002/api/$1; }; for sat in `api "dashboard" | jq -r '.data.satellites[]'`; do printf '%-8s %-5s %-5s %-20s %-5s %-5s\n' `echo $sat | cut -c1-6 ; api "satellite/$sat"| jq -r '.data.audit[] ' | xargs -L5 ` ; done
121RTS   0     0     1                    0     1
12L9ZF   1     1     1.95                 0     1
12EayR   47    47    18.294848193313815   0     1
118UWp   1     1     1.95                 0     1

Something doesn’t add up here.

cdhowie · September 5, 2019, 11:25pm

This tracks with my node, as best as it can with only one audit: almost all of the traffic is from 118UWp and the only audit is from 12EayR.

And FWIW, my node has had maybe 10-15 minutes of downtime as I tweaked various KVM settings toward the beginning of its lifetime, as well as small moments of downtime whenever the container is upgraded. Otherwise the node has been up and reachable 24/7. Note my uptime scores that are above or near 0.99.

Alexey · September 5, 2019, 11:35pm

The vetting process intended to be at least a month.
Please, keep your nodes online!

cdhowie · September 5, 2019, 11:37pm

So presumably the audit frequency will pick up? Because at the current rate it’s 4 audits/mo.

Alexey · September 5, 2019, 11:40pm

your node should pass 100 audits to got vetted on each satellite independently.
Of course, you should have data from it

cdhowie · September 6, 2019, 4:10am

I did some testing and found that the array can write 64MB and fsync() it in about a half a second, but in the logs I typically see the failures after about 3 seconds. With 400Mbps downstream and these array write speeds, it seems like I shouldn’t be losing as many uploads as I am, unless the cause is network latency (say, most uploaders are in Europe or something – I am in the midwest USA).

========== UPLOAD ============
Successful:           3
Rejected:             0
Failed:               16
Acceptance Rate:      100.000%
Success Rate:         15.790%

BrightSilence · September 6, 2019, 6:22am

That’s because this satellite has MUCH more data on the network than other satellites. So when it audits a random stripe, it has a much larger pool to pick from. And despite you having the most data from this satellite, it’s still a smaller percentage compared to the data you have for other satellites. There is really nothing you can do on your end to speed up that process. Just wait and let it happen over time.

greener · September 6, 2019, 2:59pm

it has a much larger pool to pick from

Pool of stripes or nodes?
In case you mean stripes that would mean that my audit rate would only slow down over time as traffic grows (it’s currently quite low). My current audit rate for 118UWp is 1 in 10 days which would take 1000 days to get the node vetted.

BrightSilence · September 6, 2019, 3:09pm

I meant a pool of stripes. But your math is off.
I believe unvetted nodes get about 5% of the traffic vetted nodes get, but you have much less than 5% of what the average node holds, so more traffic gets you closer to that 5% and gets you more audits too. Trust me, when there is more activity it takes about a month.

But there are also changes being worked on that prioritize unvetted nodes for audits, so that may help in the future as well.

cdhowie · September 6, 2019, 9:22pm

Today I got my first upload from a different satellite, in fact quite a few uploads. These have all succeeded. This strengthens my believe that the 118U satellite is just geographically distant and so I’m losing out on uploads to nodes that have less network latency to the uploader. Is there a list somewhere of the official satellites and their geographic location?

BrightSilence · September 7, 2019, 5:22am

I don’t know the exact names, but there’s the development/test satellite in germany and 3 other tardigrade sats in europe, north america and asia.