Updates on Test Data

I would say testing it in production is secondary. Our primary test is our simulator.

But that is only one part of the equasion. We can optimize the code but we can’t make this work for the dumbest imaginable nodes. We will have to give all of you a chance to verify in advance if your nodes will be able to keep up with the load. So production it will be at some point. Just not now while we are still working on some code changes.

My hope is that we can share the simulator at the end. It has helped me to optimize my filesystem to unlock some extra performance.

The profiling is anonymous. We don’t care about a few slow nodes. For us it gets interesting when the node selection picks too many slow nodes or something else is broken. At that point we start profiling and it also means the slow node has some neighboors to hide in between to call it anonymous again.

You mean storj select? There was already an announcement about that a few months ago.

1 Like

Perfect :slight_smile: and off course add to the instructions for setting up a node :wink:

Understand and Agree :slight_smile: node ranking tables maybe, like a leader board - everyone likes a bit of competition to be #1, and can be used as part of fallback. Maybe a historical node selection window, so if you can’t get the minimum number of nodes in first select, you revert back to some nodes that performed well T-1 regression ago… I’m sure the data science team will be all over this algorithm, unfortunately math is not my strong point - good node behaviour should be rewarded, but slow unworkable nodes should be shunned.

Thank you for responding, I know it’s not your role, but I trust your opinions as honest.

:heart: CP

There are concerns that allowing users to choose the region where their data is stored will likely lead to severe stratification.
Regions such as Ukraine/Russia/Israel/… may stop receiving data.
Perhaps you will consider the option of primary storage wherever the client wants and backup distribution throughout the network - probably the backup data will be cold but will allow you to balance the load and eliminate stratification.

Showing operator rating data on the monitoring page will help operators without a simulator try to optimize their system.

If you share the simulator, you will show us our % of success - I’m sure the community will try and make the necessary optimizations at home.

No VMs. There’s quite literally no reason for them. All nodes are run in tmux sessions on the base linux OS with direct access to drives. No raid, no nfs, no cache etc. CPU and RAM are practically idle, but there is high IO wait.

So the fact that a particular application utilizes hardware resources in such a way that even enterprise hardware struggles to handle is somehow me doing something wrong? And what is the limitation in my system? How do I remedy it? Please explain…

Again as I said to someone else already, I’m not talking about 2 or 3 nodes / drives. Why does everyone keep comparing their 2 or 3 nodes to my particular situation? I’m trying to point out an issue when running larger scale energy efficient systems. But yes, I did mention seeing a number of posts recently related to varying degrees of IO issues. Everybody runs on different hardware with different specs / limitations though so there’s no one size fits all approach here. One thing everyone does have in common though is high IO that when scaling up becomes much more prevalent. This is my point.

And generally speaking, having to run 36 separate node instances for 36 drives spread across 4 separate drive controllers for a single server is not a very efficient way of scaling. I understand the technical reasons for this (1 node 1 drive etc) but there should at least be some communication between node instances running on the same system so they don’t all run filewalker / garbage collection at the same time.

I think the way forward is to start moving to all flash storage, with some kind of a flash caching as a stepping stone.
U.2 drives are now reasonably cheap, especially second hand as almost nobody is buying those. Backplanes for those are dirt cheap as well. This will, once and for all, solve all the IOPS issues and upgrade the performance of the network tremendously.
I also think this is a better way to solve this problem than to tinker with the code, as we will eventually hit the limits of the HDDs no matter how good the code will be.
And I think there should be an incentive for SNOs to move to all flash storage. Other than winning more races, something like earning a few cents more for every TB stored, something that won’t discriminate the other SNOs too much.

hmm it’s still significantly more expensive than HDD’s… In my country I get a 18TB HDD for roughly 270€, a 2TB SATA SSD costs 120€ / 4TB SATA SSD costs 320€. If you consider that you would need 4-8 times more SATA connections, power cables etc. to get the same amount of storage, it’s really significantly more expensive and needs more space in your server.

4 Likes

Yes, but if someone is running an enterprise setup they can get ~8TB NVMe SSDs for $500 and there is an assumption they can afford it.
The connections usually aren’t a problem as it is a matter of swapping of the backplane (if at all, as we are talking about IOPS and not BW, and for such use cases SATA/SAS is good enough), adding some riser cards, and the cables can usually be reused.
I know there are people that would be glad to move to all flash, but I understand before spending tens of thousands of $ there would have to be an incentive and some future outlooks.
But the time will tell and if all will go well I believe we will see such setups as well, hopefully as a majority.

2 Likes

It can be used for caching, but the entire volume is very expensive and not at all profitable.
If there is a division into hot and cold data or some kind of caching option from STORJ, then we can really talk about it, but a complete transition is an economically stupid decision.

2 Likes

For ingress perspective this behaves like one node on an array. Adding more nodes does not increase ingress, if anything, it makes it spread out to more drives. It’s a very bad configuration.

1 node 1 drive is not a literal requirement to avoid arrays.

In your situation the solution is simple. Assemble all your drives to an array, and add ssd accelerator (cache or metadata device).

Staggering filewalkers is not necessary because the problem does not exist on properly configured system. Filewalker IO is never supposed to reach drives. They only need metadata, and metadata shall be in available in ram and/or ssd. This also improves node response time under normal operation.

Because it should scale. In your case it does not scale. That is why we suspect is it a YOU problem.
If your drive controller can’t handle 9 drives under load, while a Raspi can handle a single drive, maybe your controller is the problem? (BTW I doubt the problem is the controller).

Maybe increasing ingress is not the goal, but balance out load and loosing less in case of a HDD failure. I don’t think this is a bad config per se. On the contrary, there is a reason why this is (or was) the recommended way. But yes, a proper ZFS pool of course rules! Just not that easy and you waste some space.

3 Likes

Hello.
Can you elaborate your setup?
cpu/ram dont care
what RAID controller or HBA drive all this and how many controllers there?
what model exactly?
what is the strip size in configuration?

4 Likes

I missed a paragraph break between the two sentences, creating the implication that they are related or caused by one another. Spreading load is good, and configuration is bad, but these statements were not meant to be related.

As you said, proper array would be best; it would not only spread load better (by balancing actual load from multiple nodes between all available vdevs) but also allow to manage high iops load better.

The reason config is bad is two fold:

  • lack of load balancing: it just allocates 200 iops to every node regardless of whether one needs 300 and the other is not using any allocation at all
  • accelerating such config is not feasible. You would need to add an ssd to every drive. With an array a single small SSD can take care of all workload.

Likely not, but could be — If sata port multiplier similar to the one that is flooding Amazon is involved anywhere it can create a huge mess.

With LVM you can actually use one SSD for many logical volumes, where one LV lives on a single PV. You just have to create separate cache LVs, but they all can reside on a single SSD.

5 Likes

Yep, doing exactly this myself. A single SSD covers 5 of my HDDs, with still plenty of IOPS to spare.

3 Likes

I wonder how this works in case of gc filewalker? I might be wrong, but doesn’t the bloom filter only tell the node what segments to keep? Is the segment part of the filename?

I’d seriously hope so, to the point that in my mind I did not even consider this being done any other way: it seems they are using some sort of content addressable storage scheme, so whatever they are sending better only require metadata access to process. It would be ridiculously inefficient to need to fish identification from every file.

It’s not content-addressable, as this would mean the address (file name) is derived from the content.

The file name of a piece is just a random string of characters. And the garbage collection process operates on these random file names, not on the content. That is, it’s those file names that are checked against the bloom filter.

2 Likes

Good to know. I guess there is no need for it to be content addressable indeed. Deduplication is not applicable and integrity can be verified crytographically. But then something still needs to be done to guarantee they are unique.

There is no need to guarantee they are unique. Piece ID, which is what a file name encodes, is a 256-bit long random identifier. Going by the birthday paradox you’d have to store 2¹²⁸ unique pieces until you have a reasonable chance at having a single accidental collision. It’s a rather large number, for example you would have to have around 100 thousand Earth masses of 20 TB hard drives to store that many of 250kB pieces.

2 Likes

Is the traffic from static.x.x.x.x.clients.your-server.de and unn-x-x-x-x.datapacket.com the test traffic?