I wouldn’t use the exact technical solutions in the ToS, otherwise we would need to update it with every software update. The legislation process is far away from our delivery schedule, so ToS will become outdated even faster.
The main technical document is Storj Whitepaper V3, one of the legal documents is Node Operator Terms & Conditions, they should be aligned, but it’s better to do not use exact solutions of each other.
We can spent hours together to formulate terms and conditions, but they likely will be different in the resulted ToS, I am not a lawyer and I think you too (from the other hand who can know better how to evade rules? Just loudly , no offending here!)
I would prefer if we, as a Community, would found a technical solutions instead. Even just ideas would be better, than trying to invent rules without a technical solution behind.
For example, how to do not forbid to use different IPs in the same location but make sure that the pieces of the same segment will not be placed on such nodes?
I read here at least one possible idea - use latency. But how to measure it, if the node is contacted by different clients around the globe?
one of the technical possibilities to track upload to nodes by their ETH address for example
it is technically possible to make a separate address for each of the nodes, but in practice it is tedious and again the payout threshold and
you shall not pass, thus operators will pooling nodes for payouts (what is really happening right now)
This is technically not true. Any company can van or allow any form of speech they want on their own platforms. This is luckily not the stance Storj has taken, but they don’t need ToS to cover for that. It would be useful to have ToS to back them up in case they want to ban nodes from the network for “abuse” though. I think that part is a little less clear cut.
That said, I agree about the tone for the most part. Though I can see why people trying to gain an advantage over other SNOs with node selection can be considered cheaters, I think Storj Labs using those terms considering their so far hands off stance and lack of specific terms limiting that behavior, isn’t currently warranted.
Yeah, agreed. In the end any term covering this will be very boring and basically just come down to “Node operators will not attempt to circumvent data distribution in order to get more data or bandwidth”. I’m not even going to bother putting that in better legal language because it’s not interesting and in the end it isn’t a solution to begin with unless there is a technical implementation of filtering system to either prevent it or ban nodes who do it anyway.
If the benefit of multiple IPs is removed there is no incentive to use them anyway. Might as well just use a single IP then.
I don’t think that there is a very good technical solution. At least I can’t think of any. Latency is too unreliable and you run the risk of affecting honest node operators. Filtering by email or eth address is too easy to work around. Requiring KYC and providing unique operator keys is too intrusive and creates a barrier of entry that many just will never agree with. Blacklisting VPN/VPS IP ranges is an endless cat and mouse game that even giants like Netflix are losing and would kick out legitimate node operators trying to work around CGNAT as well. Using some of the above (and other signals, coordinated downtimes, shared domain names or similar domain names, etc.) to manually find and ban abusers is too cumbersome and time consuming and won’t get rid of all of them either. People will just get smarter at hiding it.
Maybe there is something I’m overlooking, but none of these sound like acceptable solutions to me.
In any setup and arrangement there will always be a small number of participants abusing the system. They will always be there and their number will always be small.
Maybe the technical solution here is not to try to enforce 100% honesty, but instead tolerate small percentage of cheaters? For example, increase redundancy slightly.
I’m sure storj can ballpark number of suspiciously correlated nodes, I don’t think it will be high enough to justify wasting engineering time squeezing these exponentially difficult to achieve last drops of compliance.
The Operator can run different nodes in different places, so we cannot assume that all these nodes are in one physical location or on the same hardware.
It is not our goal to pay less to multi-node operators, we want customer data to be secure.
I run 8 machines, each in it’s own subnet and in different physical locations, but in a small geographycal area. The internet outages and blackouts are pretty common and they take out my nodes in groups, or all at once. But in rare cases, only 1.
So to distiguish me from an abuser, there should be an history on 3-6 months maybe, that will track the outages. If in 6 months some nodes of the same operator go dark in the same time, and never go dark independetly, than you can assume that they are in the same location or in a pretty small geo area, that it’s equivalent to same physical location.
In the end, the physical location dosen’t matter at all, just the time beeing online and offline. If some nodes spread across the globe go dark in the same time ALWAYS, than, for Storj point of view as data avaiability for customers, those nodes can be threated as one node.
Other way of identifying a location whoud be MAC address of the router and machine, but I don’t know if it is practical.
I mean, that’s why I said to use multiple signals. Coordinated downtime might be a strong predictor, but ISP maintenance can cause this as well. As can regional power outages. It might be worth using data that is quite obviously pointing to people running multiple nodes to train statistical models to find other such signals.
Unfortunately no single signal is a black and white indicator.
Well, writing a script to randomly shut down nodes for random intervals is not that hard.
This is likely the path forward, but it will required massive number of nodes, a lot of data collecting, expertise in machine learning, and a lot of training efforts that are a distraction really, all to weed out 1% of cheaters. Which ultimately, will be impossible.
An example I think of in this context is actually me: I have one node in CA, and another node in OR, at my brother’s house. But he is behind CGNAT, so I’ve setup the vps on Oracle to route traffic.
On my node I don’t want to expose my IP address nor mess with DDNS, so I route traffic to my node also through another VPS on Oracle. This allow me to do some interesting things – like LTE failover when my IPS has an outage, without messing with DNS and routing: node just reaches out to my VPS from whatever connection available. It’s super handy.
But in this case these nodes look indistinguishable – same number of hops, same (very close) latency, same datacenter, IP addresses in different /24 segment even. I could have been a jerk and run second node in my house using different VPS from the same network and there would be no way to tell. No amount of meta analysis would have detected this because it’s absolutely symmetric situation.
Point being, you can’t eliminate 100% of cheaters, so I vote for incorporating some “shrinkage” in the nodes diversity assumptions and the efforts to be spent elsewhere.
Yes. The /24 subnet limit is used for the same - a one big ISP, who can shutdown the part of the network should not get more than a one piece of the same segment.
So, I do not see an issue here. Subnet grouping may have a learn curve as for audits or suspension scores, i.e. it can have a weight between 0 and 1, and it can be used as a percentage of the strictness. But I guess it would work more like an online score calculation, i.e. depends on some time window for example.
So, in general I offer to measure a correlation between nodes. If connect AI to the process, it would be even more precise, and spike downtimes wouldn’t affect the “correlation score” too much.
To solve the problem, that two pieces should not end up on a same machine, I just would filter it out by email account or payment address, as was proposed here. It makes so negligible difference (if any) in ingress even to a large operator, that there won’t be incentive to work around that. Advantage in preventing file loss is huge.
Not only the same location is problem IMO, also more pieces should not end up under a same operator. I remember somebody stopping all his 30 nodes since he didn’t have time for that anymore, few months back…
We doesn’t have accounts for SNOs, email address could be omitted in the node’s config. This email is used only for notifications.
Perhaps you are right about too heavy correlation between nodes of the same Operator, but I do not like the idea to limit by the wallet address, it will force them to use a different wallet for each node. But it will have a couple of disadvantages though, they would need to manage multiple addresses to collect the total payout, it also will increase time before the first payout, unless they would use zkSync.
However this also will be used as an exploit: spin-up as much nodes as possible with a different wallet addresses, because each node will be treated as uncorrelated in case of filter by wallet.
My thoughts are that the redundancy reasons for multiple nodes on the one segment doesn’t stack up on a global scale. If storj was really serious on redundancy and spreading data pieces, then I’m sure they would pay more for nodes in locations that actually offer redundancy than the same as someone stacking up nodes in one of the highly node populated Eu countries. The location map of nodes is a glaring falsehood of node redundancy and risk reduction. I can see Putin lobbing a bomb into the power grid of one of those highly node populated cities and storj customers so impacted they have no alternative than to leave and go else where knowing it could happen again. Ukraine, 9th in the node lists, thats got to be risky. I doubt putting nodes on different segments is really the issue there or there about.
Even if they manage to combat the vps abuse, I think this dosen’t solve the centralisation problem. The race condition buit in the Storj network, by storing pieces on the fastest responding nodes, is the main cause. Maybe I’m wrong, don’t realy know how the nodes are choosen by sats.