Fast upload bandwidth has the potential to greatly impact earnings for storage node operators.
Numbers behind suspension, audit, “online” calculation, are quite well documented, transparent and clear.
But how about internet speed? How exactly does storj evaluate my node’s speed and what is the impact when choosing whether to store it here or on some other node? Are there any speed checks being performed to determine it? Or does it purely rely on the “race” when whether my node is fast enough to store/provide piece first or falls into “download/upload failed” once someone else is faster?
The answer to this one answers all your other questions as well. Yes, it purely relies on the race conditions. But you don’t lose just cancelled and failed uploads, you also lose many uploads that display as finished in the logs. From my experience, the impact of those less visible losses is probably higher than the ones that are visible. This happens if your node finishes the transfer before it receives the cancellation message, but still wasn’t among the 80 fastest nodes. And it’s very common.
Due to an unfortunate SSD failure I’m temporarily running 4 nodes on the same array without an SSD cache (this is a very bad idea, don’t do it) and I can see this effect happening due to lagging write performance on my nodes. Trash is much higher and the nodes barely fill up anymore. I have ordered new SSDs to resolve this asap (hopefully this weekend).
Because it’s a race against other nodes, it’s hard to exactly quantify the impact. There isn’t a set speed I can tell you after which you start to lose races. And as per my case above, this isn’t solely dependent on connection speed. It’s latency + throughput + IO performance + system overhead (CPU IO wait queue for example). But my unfortunate issue is starting to make it very clear that having bad performance in the upload chain has a big impact on how much data you get to keep on your node.
You mean you have a ssd-read-write Cache for all your nodes?
Do you have some Numbers (e.g. the payout per tb/stored/month) with and without the Cache?
Or do you have only the Databases on a SSD?
I used to have 4 nodes running entirely on the SSD cached array + 8 more nodes on external HDDs for which the databases were on that array as well. I have since moved those databases the the individual node disks to alleviate some of the write load on the array now that it doesn’t have an SSD cache anymore.
The most telling numbers I can share is that my trash on larger nodes (3TB to 20TB per node for different ones) has gone up from always around 60GB to 180-200GB. Meaning I have more than 120GB trash on top of what I used to have. This means I’m losing 120GB MORE from ingress than I used to. Since trash is deleted after a week, that means losing out on about an additional 500GB of ingress per month, which is about 2/3rds, since last month I had 750GB ingress total on my main node. Needless to say, my setup is not exactly optimal at the moment, hence why I’m trying to fix that as soon as possible.
Considering that about 15% of data ingress each month also gets egress, that amounts to a loss of about 75GB on egress as well. So yes, bad ingress performance can have a big impact on payouts.
Do you have any ‘control’ nodes? Your ‘trash’ experience from the last month or so has been very similar to my node storing 16TB. Could it just be the current network usage? as my node never had an SSD cache
I have the same trash experience this month as well. Two spikes of 3 times and almost 4 times the amount i normally have across all nodes (individual HDDS no cache). Ive also made no changes this month. (I exited most satellites in november, but did not experience the trash increase until this month so I consider it unrelated) Plus ingress has been lower.
How do your cached nodes do against your uncached nodes? Success rates and general growth rates? Im curious how much an ssd cache helps nodes on individual HDDs vs the cost of the ssd and its wear as cache.
I can say that my external and internal nodes used to perform quite similar, but the external nodes weren’t entirely uncached since their db’s were on the cached array. I haven’t really checked the numbers because I didn’t split up the log files before and after the cache failure, but I can easily tell from just following the logs that I have a lot more cancelled transfers now than I used to have.
Can the internet downstream speed aka ingress affect the maximum blob file size on the node?
I suspect with 100Mbits no blob bigger than 2,266 MB on the node. this would be realy bad for the number of files on the disk with slower internet. and maybe big disks on small internet become unmanagebal…resulting in the “timeout error”
I don’t think I’m getting what you’re saying. That is the maximum size for pieces to begin with (64MB/29).
It’s in theorie possible that slower connections have more trouble winning races with larger pieces. But I don’t have any way to collect stats of that. I think latency will remain the biggest factor as apposed to throughput.
64MiB is a default segment size, so the maximum size of the one piece is 64MiB/29 (because only any 29 pieces from 80 are needed to reconstruct the segment).
Of course the customer may upload smaller segments.
In a file storage system, where you don’t need databases to track the files you store, and the Filesystem alone is enough to do the job, the SSD for databases is useless.
But in Storj, we store objects, aka pieces of files, that can be tracked and retrieved only by storing their wareabouts in databases, and this adds a lot of I/O on top of the normal I/O for reading/writing the pieces, that can be speed up by moving the databases on SSD.
I hope I undestood this right.
Storj does not use databases to track pieces. It was done before but turned out to be horrible and unreliable — because you can’t trust folks to have a setup that does not corrupt data, it turns out. Today pieces are only tracked by filesystem (a form of CAS).
SQLite Databases node manages are only used for local accounting and showing pretty plots. They are useless otherwise. I honestly have no idea why is there no switch to turn them off and not waste precious IOPS on data nobody ever looks at.
It’s so obvious, because you can delete databases and all is working; why I didn’t realised it.
You are right; there should be a switch to kill them all if you don’t want that info; you get rid off malformated databases and a lot of I/O.
… So how does the satellite address a piece on your HDD? By filename and folder?