So I thought I would post an interesting observation of my nodes usage over the past 3 months, and reach out to Storj on this behaviour to get their view.
In all of this post we are ignoring repair traffic and Dev load data, as it skews everything.
The issue we are focusing on is "The successful storage nodes to store the segments from the list generated by the satellite, will be selected by the customer who has the best connection, and performing node - you will get cancelled early / late drop of client connection if your node doesn’t respond quick enough "
TL;DR - 1 ) My node it no longer see the above as true, with Gateway-MT - it is strongly biased towards nodes being geo-located closer to these Gateway-MT servers from the list of nodes returned by the satellite. The connection is no longer from client to SNO, but instead client from anywhere in world, to a few Gateway-MT servers located in highly performant, but centralised datacentres, and then from the Gateway-MT you will get the connection to the SNO. This will bias the SNO’s who are geographically, and ISP connection wise closer to the Gateway-MT servers impacting the geographical distribution of nodes, that are on more latent links are can’t respond as quickly.
-
It’s now impossible to geo-block countries on my firewall for uploading data to my node, as it now mostly all appears from the pool of Gateway-MT servers - How will Storj deal with local country compliance of laws when nodes can’t filter themselves ?
-
Gateway-MT is clearly really popular (good), but for me I love Storj for the decentralized nature, but we now have a highly utilised, centralized single points of failure introduced, that can impact on node selection, and piece distribution. S3 compatibility for me is definitely more of a backup / long term storage option, as apposed to low latency transactional access - how long before we have a Beta of Gateways that can be hosted by SNO’s ?
/end TL;DR
More info
The information I’m quoting if from my own node experience, which I accept might not be typical for the network, and I accept there are mistakes in assumptions and my English so sorry for that - my setup is very basic, but it manages to work and I can see from forum there are far more advanced setup’s out there, definitely at enterprise scale.
background (not complete)
All my node traffic is pre-processed and logged by pfSense firewall cluster, and sent to my dedicated logstash cluster to process (not just for storj )
I’ve got some logstash rules on there to enhance the metadata around IP address, and state with advanced GeoIP among other bits.
I’ve also got my Storj node docker logs integrated into Logstash using a docker shim and filebeats.
The output is sent into an Elastic cluster, and visualized on my Kibana server.
The purpose of all the log processing was to allow for some work I am doing on IDS, to allow for dynamic firewall control around block/allow and also around packet shaping for traffic based on a more intelligent rules engine (not relevant, but info on why I’ve bothered )
One of the side effects of the above project, is I’m able to with a high degree of accuracy map a client connection to a Storj nodes action, be it put / get / delete and further more map it to the satellite and segment ID, and ultimately the file location.
#cut
So the purpose of post was to say 3 months ago, my cancel / fail rate was 0.1-0.9% over a 1 week period
My traffic was predominantly from;
- uplink clients, or self hosted S3 gateway from all over the world ~ 70% traffic
- transfer.sh ~ 10% traffic
- gateway-MT ~ 20% traffic
Now, Aug 21 this traffic has changed really excitingly, but also with consequences ;
My traffic is now ;
- uplink clients, or self hosted S3 gateway from all over the world ~ 30% traffic
- transfer.sh ~ 5% traffic
- gateway-MT ~ 65% traffic
It is great to see Gateway-MT being so popular, but the usage of a node for S3 data which seem to be very much cold storage isn’t great
- failed / cancelled uploads
- big segment uploads from a small pool of IP’s that we can’t control who is client
- Huge deletes of data after 7/14/30 days so assuming backup jobs
- No / limited Egress traffic as there is no need to read the data as it’s probably backups
- Very full Trash cans of GB’s of data, that we aren’t compensated for when the jobs deleted.
- Excessive disk Iops from the move to trash code >< I get it, but when we are talking GB’s into the trash, that is a write / read that wears on the disk not to mention the fragmentation.
Discuss
CP