Satellite seeing heavy use, but not getting audits

snorkel · July 27, 2023, 5:13am

I wonder if he gets unvetted pool data or vetted on that sat. If he gets data from vetted pool, it’s like he skiped the vetting process, and this is a very big bug.

Alexey · July 27, 2023, 5:16am

I think it’s unvetted. And it will be there indefinitely until start to get audits.

snorkel · July 27, 2023, 5:39am

@Mircoxi you could start a second node on that machine, whitelisting only the eu1 sat. If it gets audits, than your network is fine, and there is a problem with first installation.

Alexey · July 27, 2023, 6:49am

It could be not the same, because the port will be different.
I would like to ask to change the external port and all related settings in the router, firewalls, docker run command (or config.yaml in case of the Windows/Linux GUI nodes) and check, does it start to receive audit requests or not.

Mircoxi · July 27, 2023, 7:13am

Just bumped the port number - I’ll let it run for a few hours and give it a look in after work. Thanks!

Mircoxi · July 27, 2023, 8:18am

So, this seems to have broken things even more… With the port number changed, I see in the logs a lot of connection refused errors from the satellites. However, now I’ve changed it back to the default port, it also can’t get any pings any more, giving timeout errors - the original firewall rule stayed intact, with the new ones created using the duplicate feature in OpnSense, so I’m very confused right now. It’ll probably be a few hours before I can investigate properly, but it’s very weird that known good settings are suddenly broken…

Actually, checking again, it looks like TrueNAS suddenly doesn’t want to give the container a UDP port. There’s an update out anyway, so I’ll install that and reboot the NAS to see if it starts working on the default port again, then try a different one. To double check while that goes through - am I right in assuming the main setting I need to change is the external address in the config, to e.g. storage.domain:28970 or some such? My understanding is the listen port config setting is the internal container port, but it nver hurts to double check!

Edit 2: Confirmed the issue by duplicating my Plex firewall rule and changing the port number. Port checkers online can see Plex, but not Storj. Weird.

Edit 3: TrueNAS rebooted and the container has UDP again, but still timing out on satellite pings on the default port. Dashboard shows as offline. A different port checker is showing the port as open.

Edit 4: Got some traffic on the default port, trying a different port again. It took about 5 minutes for the dashboard to show everything was okay though, so I’m wondering if something is weird with the container after all…

Edit 5: It may be TrueNAS, it seems to be very inconsistent on actually giving the container a UDP port. I’ve managed to get traffic by forwarding external port 28970 to the NAS and having the container run the default ports (editing the external address in the config), I just need to wrangle TrueNAS into actually opening the ports properly now. I could also set up another machine to run Storj itself and use an NFS mount for the files, but I feel like that’d impact performance a bit.

BrightSilence · July 27, 2023, 9:01am

Satellites work with a node selection cache that caches the last known IP:port combination. So it can take a while for traffic to start flowing on a new port. Keep it online for at least 5 minutes. Though if your port doesn’t show as open on port checkers, you have a different issue with that port. For a docker setup you only want to change the first port in the docker run command. So it looks like 28968:28967. And don’t change the listening port in the config.yaml. Only the port in the address in your docker run command. And in the above port mappings (for both TCP and UDP). If the port checkers see the port as open then, just wait until you see traffic.

thepaul · July 27, 2023, 5:35pm

I’m more sure about the “bad luck” theory now.

Eu1 is sending significantly fewer audits to new nodes than us1 or ap1, even at comparable piece counts. It’s doing more audits overall, but eu1 has the highest total number of pieces, so new nodes have a comparatively smaller share of the overall audit pie.

There are other nodes that have gone without audits for a similar amount of time and have a similar piece count. If I graph node lifetime versus piece count with color being the number of audits, it’s clear that your node is taking longer than average to get an audit, but it’s not very far away from the norm.

From this I infer that (a) you should get an audit in the next few days, and (b) we need to spin up more auditors on eu1 so that it doesn’t take so long.

Vadim · July 27, 2023, 6:22pm

Why you dont make audit process like storage node, not give opportunity to node operators run it for some revenue, it will be much cheaper for storj and much more efficient, there is lot of possibilities even that way to make like zero trust technology. Node operator traffic is much cheaper. for example you can give same task to several auditors and compare results.

thepaul · July 27, 2023, 6:24pm

Yes, we’d like to do that! It will take a lot of careful engineering to make sure it isn’t abused immediately, though. And there are other features and performance improvements that need to come first to keep the customers and SNOs happy.

Vadim · July 27, 2023, 6:31pm

there cold be some agreement for that with sno, that can resolve some staff.

snorkel · July 27, 2023, 8:42pm

No need for agreements. Just choose a random node that has piece A to replicate it to a node that is vetted and audit it’s existence and integrity from time to time with a hash of that piece. The auditor node should be at least 1 year of age, with no Gracefull exit active on the sat from were the piece came in, and it should not replicate more than one piece for a particular audited node.
The process of choosing auditors should random choose every node older than 1 year in the long run, and should try to keep an even balance of pieces replicated and audited between auditors.
The node audited should receive pieces from all over the world. All this auditing should last for a month for every new node, for every satellite.
The number of audits should not be a fix value.
Every auditor in course of a month should try to audit a piece at a variable interval in a widow of 1 audit/ 12 hour or something. After 30 days, the satellites draw the line and check the number of audits sent by auditors and the number of audits passed by nodes.

Mircoxi · July 28, 2023, 1:56am

Aah, interesting and fair enough! The port change doesn’t seem to have done anything so this seems right - the other nodes are getting audits at about the same rate as before (us1 only has a couple more until it’s done, assuming 100 is the benchmark), but eu1 is still sitting at zero. Looks like I’ll just have to wait it out after all! I’ll go ahead and change the port back to the default (purely so I don’t spend hours trying to figure out the issue in six months time when I inevitably get the urge to tweak my network and don’t look at the right firewall columns) and check it every so often. If it’s still at zero in a couple of weeks I’ll report back.

Thanks for all your help!

Alexey · July 28, 2023, 4:38am

please do not do that, it will stop working sooner or later, see Topics tagged nfs

Alexey · July 28, 2023, 4:50am

Would not work for several reasons:

we do not use replication, each piece is unique;
if this piece is not registered on the satellite as stored on the checked node, it will be removed by the garbage collector;
the malicious user can detect that some piece is checked every time and it will make sure that these pieces will be here forever while removing others;
storagenode doesn’t store a piece’s hash, it calculates it on the fly, so if the checker node doesn’t have this piece, it cannot validate it on the other node, and since all pieces are unique (you cannot register the same piece for two nodes), one of them will be inevitable removed.

But you probably mean not to clone the piece, but move. This could work, but then the node must store its hash somewhere to be able to validate, but this also mean that this checker node would lean only on the hash (the auditor also downloads and validate some random part of the piece to make sure that the piece is retrievable).
It also should not move a piece to the node stores at least one piece for the same segment, so it needs a coordination with the satellite to select the right node. I do not think it would give less load on the satellite in this case, more like it will be much more than with trusted auditors.

But the idea is nice, and if we would still use a kademlia, it could work without involving of the satellite.

In the current implementation it would be probably implemented in a different way: the satellite gives a job to the node on its check-in with the list of nodes and pieces, then the node will submit a signed order with results when the job is completed.

snorkel · July 28, 2023, 6:54am

I imagined it like in the auditing month, on the new nodes, there should not be pieces from the 80 nodes, just copies that can go to waste and don’t need repair on other nodes if the new nodes stop working, because the presumption is that a new node can be faulty and crash, and doing repair (recreate the lost piece on other node) consumes resources and Storj money.
But you are right, this would add to much change to the all ecosystem, and dosen’t work with the current network rules.
But leting this part aside, the rest of the plan could work.

Alexey · July 28, 2023, 7:37am

We do not have copies anymore. They were on v2 though alongside with erasure coding, but v2 did not have a repair worker, since all orders were expired after 90 days and the customer had to track this and re-upload files to do not allow their files to disappear. Almost as now if you would use a filecoin . Cheap (regarding code, but not used space!) and not reliable ever.

Now we use erasure codes only, reliable repair workers and auditors, so no need to have copies anymore. And orders are not expire unless the customer will explicitly specify that during upload.

Mircoxi · July 30, 2023, 6:18am

Checking back in with some good news - almost exactly as us1 finished auditing yesterday, eu1 started! Currently sitting at 10 audits in 24 or so hours. Turns out I really was just unlucky. Thanks to everyone who helped!

snorkel · July 30, 2023, 10:27am

This dosen’t solve the general problem. Storj must step in and do some modifications.

Mircoxi · July 30, 2023, 10:43am

An interesting, possibly unrelated (but I’m not sure) issue - eu1 suddenly shows an online score of only 75%. This seems to have happened over just a couple of hours, since it was at 100% when I made my last post. Could it be that it was trying to send audits the whole time but they were getting blocked somewhere, and are now counting as the node being offline?