Satellite seeing heavy use, but not getting audits

Mircoxi · July 22, 2023, 1:02pm

Hi all! New to Storj here - while I did look around the forums before asking, I didn’t find anything, so apologies if it’s old news and I just missed it. I recently built a NAS for uni work while I do a second degree, and decided to run a Storj node for some passive income while I work my way through it. It’s been up for a week or so now. I’m seeing pretty consistent traffic to it, and after a couple of days, came across the wonderful earnings calculator tool and Grafana dashboards. They confirm what the Storj dashboard itself is telling me about usage, but I noticed something interesting on the audit calculations. According to the Grafana dashboard, the eu1 node is my second most “active” satellite.

However, according to the earnings calculator, I’m not getting any audits whatsoever from that node:

(Images merged into one because of the new user limit, sorry for the untidiness!)

I’m aware that the vetting takes a while, so I’m not too concerned just yet, but I’m also aware that failing audits can lead to disqualification, and it seems interesting that I’m holding so much data from the satellite but it hasn’t decided to check up on it yet. I recall reading something about geofencing, and I’m based in the UK so I’m not entirely sure whether we qualify as the EU for that purpose anymore, but I’d assume that in that scenario I wouldn’t be storing any data anyway. Also, I know that eu-north-1 has been retired, so I’m not worried about that - just eu1.

I’ve checked my firewall and all the appropriate ports are open and routed properly, and the other satellites are auditing/vetting just fine.

So - is this something I need to worry about?

Alexey · July 22, 2023, 1:21pm

Hello @Mircoxi,
Welcome to the forum!

The name of the satellite and its location doesn’t make a lot of difference: data is coming from the customers directly to your node, they may be even from Australia.
However, frequency of audit requests depends on amount of data of their customers on your node.
6 days is too small to make any prognosis, even to cross the first held amount threshold you need at least 3 months, so… Keep your node online, it should pass the vetting process on each satellite (100 successful audits), for that your node should have enough customers data (the vetting node can get only 5% of total ingress from the customers, until got vetted (100 successful audits) ~1 month or more). So, please, check after a month

Mircoxi · July 22, 2023, 6:41pm

Thank you! I figured it might just be a time issue, but seeing everything else be around half-way done and that one not having started got me a little concerned. I’ll keep an eye on it and hopefully start to see the percentage start creeping up soon. Thanks again!

snorkel · July 22, 2023, 7:57pm

Although I agree with Alexey, the lack of any audits on that sat is very strange. You should check your logs to see if anything pops up. I’ve past the vetting process a few times and the number of audits were from high to low: US1, EU1, AP1 on each node. Now I understand that they speed up the vetting process, and there should be more than 0 audits on EU1.

Mircoxi · July 22, 2023, 8:17pm

Thanks for the tip with the logs! I’m running this as an app in TrueNAS Scale using the default image, and, annoyingly, it looks like their installation is set to INFO level logging (and there doesn’t seem to be an option to change it either), so it’s a little spammy. I tailed the last 5000 lines of the log, and the only Error level warnings look like this:

2023-07-22T20:01:05.450901627Z 2023-07-22T20:01:05.450Z ERROR piecestore upload failed {"process": "storagenode", "Piece ID": "[piece ID removed]", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "error": "context canceled", "errorVerbose": "context canceled\n\tstorj.io/common/rpc/rpcstatus.Wrap:75\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload.func6:500\n\tstorj.io/storj/storagenode/piecestore.(*Endpoint).Upload:506\n\tstorj.io/common/pb.DRPCPiecestoreDescription.Method.func1:243\n\tstorj.io/drpc/drpcmux.(*Mux).HandleRPC:33\n\tstorj.io/common/rpc/rpctracing.(*Handler).HandleRPC:61\n\tstorj.io/common/experiment.(*Handler).HandleRPC:42\n\tstorj.io/drpc/drpcserver.(*Server).handleRPC:124\n\tstorj.io/drpc/drpcserver.(*Server).ServeOne:66\n\tstorj.io/drpc/drpcserver.(*Server).Serve.func2:114\n\tstorj.io/drpc/drpcctx.(*Tracker).track:35", "Size": 18688, "Remote Address": "[IP removed]:56660"}

There’s a handful of these, but my understanding is these are somewhat typical messages that get thrown when you lose the race to upload something, and that satellite ID is showing plenty of successful uploads too.

snorkel · July 22, 2023, 9:27pm

Yes, these are normal, lost races. You can set the log level in config.yaml or in the run command, if you use Docker. If you want the minimum recordings, use log level error. Edit config.yaml. Stop node, remove node, start node. What the standard official dashboard tells you about that sat?

Mircoxi · July 22, 2023, 10:04pm

Edited the config, and started seeing some actual errors!

2023-07-22 21:59:43.276521+00:002023-07-22T21:59:43.276Z	ERROR	orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE	rpc client error when closing sender 	{"process": "storagenode", "error": "order: CloseAndRecv settlement agreements returned an error: manager closed: closed: read tcp 172.16.0.158:52964->34.94.153.46:7777: read: connection reset by peer", "errorVerbose": "order: CloseAndRecv settlement agreements returned an error: manager closed: closed: read tcp 172.16.0.158:52964->34.94.153.46:7777: read: connection reset by peer\n\tstorj.io/storj/storagenode/orders.(*Service).settleWindow:275\n\tstorj.io/storj/storagenode/orders.(*Service).SendOrders.func1:205\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-07-22 21:59:43.276610+00:002023-07-22T21:59:43.276Z	ERROR	orders.1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE	failed to settle orders for satellite	{"process": "storagenode", "satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "error": "order: CloseAndRecv settlement agreements returned an error: manager closed: closed: read tcp 172.16.0.158:52964->34.94.153.46:7777: read: connection reset by peer", "errorVerbose": "order: CloseAndRecv settlement agreements returned an error: manager closed: closed: read tcp 172.16.0.158:52964->34.94.153.46:7777: read: connection reset by peer\n\tstorj.io/storj/storagenode/orders.(*Service).settleWindow:275\n\tstorj.io/storj/storagenode/orders.(*Service).SendOrders.func1:205\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-07-22 21:59:43.301713+00:002023-07-22T21:59:43.286Z	ERROR	contact:service	ping satellite failed 	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 1, "error": "ping satellite: context canceled", "errorVerbose": "ping satellite: context canceled\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatelliteOnce:147\n\tstorj.io/storj/storagenode/contact.(*Service).pingSatellite:102\n\tstorj.io/storj/storagenode/contact.(*Chore).updateCycles.func1:87\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/common/sync2.(*Cycle).Start.func1:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}
2023-07-22 21:59:44.769361+00:002023-07-22T21:59:44.530Z	ERROR	contact:service	ping satellite failed 	{"process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "attempts": 2, "error": "ping satellite: rpc: tcp connector failed: rpc: dial tcp 34.94.153.46:7777: connect: connection refused", "errorVerbose": "ping satellite: rpc: tcp connector failed: rpc: dial tcp 34.94.153.46:7777: connect: connection refused\n\tstorj.io/common/rpc.HybridConnector.DialContext.func1:190"}
2023-07-22 21:59:49.753920+00:002023-07-22T21:59:49.753Z	ERROR	nodestats:cache	Get pricing-model/join date failed	{"process": "storagenode", "error": "payouts service: context canceled", "errorVerbose": "payouts service: context canceled\n\tstorj.io/storj/storagenode/payouts.(*Endpoint).GetAllPaystubs:114\n\tstorj.io/storj/storagenode/nodestats.(*Cache).Run.func1:79\n\tstorj.io/storj/storagenode/nodestats.(*Cache).satelliteLoop:261\n\tstorj.io/storj/storagenode/nodestats.(*Cache).Run:78\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:44\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:75"}

The ping failures seem to be continuing while I’m typing this out, and the ID does seem to be the eu1 satellite - curious, since it’s definitely been uploading and downloading from it (the dashboard shows 76GB ingress and 1.5GB egress to it since I started running the node), and it’s still showing 100% on all three metrics… Pinging through cmd seems to work, however.

Edit: I’m seeing the saltlake node starting to fail pings too. Time to start getting concerned, perhaps? Everything is still showing 100% health in the official dashboard though… I’ve checked the firewall already, the port is forwarded correctly on both TCP and UDP.

Edit 2: Got caught in the spam filter, oops. Since then, the ping errors have stopped, so that might have just been some heavy startup IO or something?

Alexey · July 23, 2023, 5:54am

The “ping satellite failed” error usually mean that the satellite was unable to contact your node by provided contact address and port.
The reason usually a network misconfiguration (“connection refused”, “timeout”, “no route to host”, etc.), but in case of “context canceled” or “deadline exceeded” it could be that your node is too busy to respond on handshake request from the satellite (called “ping”, which is not an ICMP ping, it’s a DRPC ping - this request is sent to the node’s address and port and requires response from the node, not a simple ICMP response from the network interface).

snorkel · July 23, 2023, 6:24am

Check if you have any DDoS protection active in your router or on your machine, and disable it.
It is called DDoS protection or prevention or something like that.
If the dashboard looks OK, maybe there is some bug in @BrightSilence 's script (I doubt it).

Alexey · July 23, 2023, 7:23am

You may also see the audits count requesting the node’s API:

PowerShell

foreach ($item in ((Invoke-WebRequest -UseBasicParsing http://localhost:14002/api/sno).Content | ConvertFrom-Json).satellites.id) {
    Write-Host $item, (((Invoke-WebRequest -UseBasicParsing http://localhost:14002/api/sno/satellite/$item).Content | ConvertFrom-Json).auditHistory.windows | Measure-Object -Property totalCount -Sum | Select-Object -Property Property,Sum)
}

bash

for item in `curl -sL http://localhost:14002/api/sno | jq '.satellites[].id' -r`; do
    curl -s http://localhost:14002/api/sno/satellite/$item | \
    jq '{id: .id, totalAudits: (.auditHistory.windows | reduce .[] as $item (0; . + $item.totalCount))}'
done

snorkel · July 23, 2023, 9:03am

http://192.168.1.10:14002/api/sno/satellite/12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs

You can also check this; just modify the machine IP to yours.

Mircoxi · July 23, 2023, 12:00pm

Definitely a possibility on those, I have the node running Folding@Home in the background - I’ve set it to back off a little and reserved a couple threads for Storj, I’ve not seen those errors since.

This seems fine - I have OpnSense as a firewall to portion off my segment of the LAN, which sits in the DMZ of my router - I’ve checked both, and don’t see any logs that look like Storj traffic being blocked. It’s possible it’s happening on the ISP end, but I’m on Virgin Media who’s one of the largest ISPs in the country so I’d have assumed it’d have been mentioned before now on here (and that if Storj traffic was being blocked, it’d be all nodes). I know they do do filtering on some sites, but that’s usually only pirate sites with a court order.

Great tip, I didn’t know about this! Here’s the output from Powershell:

12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo @{Property=totalCount; Sum=11}
1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE @{Property=totalCount; Sum=26}
121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6 @{Property=totalCount; Sum=23}
12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S @{Property=totalCount; Sum=46}
12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB

Looks like eu-north-1 (expected) and eu1 are at zero.

And the output for this, in a paste: https://pastebin.com/yUpjuKJy

thepaul · July 25, 2023, 2:06pm

@Mircoxi I have no explanation for why you’d be seeing zero audits from eu1. It’s still doing audits, at least, and it’s seeing successes and failures at about the same ratio as all the others. eu1 is even doing audits at a higher rate than the others. Could you paste your node ID, and I’ll see if there’s anything funny going on on our side?

Mircoxi · July 25, 2023, 5:17pm

Sorry for the delay! I followed some instructions on using the identity binary to do that, and got this:

base58-check node ID:   11Qdygib3CQMf1SkuGrsjwR8SwVP6SGdNfCwhVSusZYDdChHJX

Hopefully that’s the correct one. I ran the earnings script again to check if there was any movement and, unfortunately, there isn’t - us2 isn’t moving but I understand that node has been/is in the process of shutting down, but the others are progressing at a nice and steady rate. Dashboard still shows 100% health, and plenty of incoming data for it.

snorkel · July 25, 2023, 6:37pm

Node ID is displayed in web Dasboard, top field.

Mircoxi · July 25, 2023, 6:44pm

Oops! Somehow missed that, my bad. Either way, the one there is correct, it shows as 11Qdygib3CQMf1SkuGrsjwR8SwVP6SGdNfCwhVSusZYDdChHJX.

My mistake there!

thepaul · July 25, 2023, 10:25pm

You’ve got it right, that’s a valid and existing node ID. And I do see something odd going on with your node on eu1. I’ll be investigating but it might take me a bit. Check back in tomorrow?

thepaul · July 26, 2023, 2:26pm

An update: the weird thing I saw was a result of never having had an audit from eu1, not the cause of it. The eu1 satellite really has never gotten any audit results from your node. I had forgotten that was even a state that was possible.

So I’m investigating why that is the case. The only theory I have that fits all the evidence is that the eu1 auditors can’t contact your node for some reason. This seems unlikely, especially since the auditor api server can contact your node, but it’s all I’ve got so far.

Update 2: It may actually be plain bad luck. I initially discounted this because you have even fewer pieces on ap1 and you have 32 audits there already (if I’m reading that output right). But ap1 has significantly less data than eu1, and it’s going through audits at the same rate, so in effect you should expect lots more audits from ap1 than from eu1 as things stand. Even so, I wouldn’t expect this much difference.

Mircoxi · July 27, 2023, 12:29am

Oh, interesting! The ones that make no sense are always my faves to debug.

I checked the output of the earnings script again, and yeah, ap1 is getting plenty of audits - sitting pretty at 42 right now, with 0 still for EU. Saltlake has 60, and us1 has 77.

The only noticable change I can see right now is that the node updated to 1.83.2 at some point while I was asleep (it had been on 1.82.1 before), but I’m still seeing steady ingress and egress to/from eu1 on my end in the node dashboard, and the Grafana dashboard and earnings script output is showing a corresponding increase in expected payout.

It’s a long shot (and I don’t see how it could be this, since other satellites are working), but could there be some weird issue with the image TrueNAS Scale uses to spin up the container? Looking at the apps dashboard, the image has a listed version of v1.68.2_1.0.15 - the 1.0.15 seems to be some internal thing to TrueNAS since many of the apps have that as part of the name (Helm chart version, perhaps?), but the v1.68.2 seems to imply whatever base image they use is significantly out of date. Obviously the container updates the binary itself, though I’m new enough to the project that I don’t know if there’s been any major changes that might make something this old have issues, but running under TrueNAS seems to be less popular than running raw Docker.

And of course, thanks for all your help so far!

Alexey · July 27, 2023, 4:42am

The storagenode image contains only supervisor and downloader above a base OS layer, so the only updates could be OS updates. The storagenode binary and the updater binary will be downloaded during the start, so your node would be up to date.

I would check possible filtering in your network (or even ISP side) regarding access from the IPs of that satellite.
The data is flowing directly between your node and the customers, so you will see uploads and downloads, it’s not related to availability for the satellite.

The interesting part is if your node would not answer on audit requests, it will be considered as offline, and your node would have records in the audits history. If it would answer on audit requests, it again will have these records.
So, I cannot imagine, what could be wrong with your configuration, thus I suspect a bug somewhere.