Updates on Test Data

Pentium100 · June 23, 2024, 12:07pm

I did not force sync on zfs, I re-enabled sync in the node settings.

Alexey · June 30, 2024, 8:53am

29 posts were split to a new topic: ZFS speed and optimizations for VM

Roxor · June 23, 2024, 12:29pm

Congrats! And thank [$DIETY] … I thought you’d never be satisfied. Enjoy your next payout!

snorkel · June 23, 2024, 12:31pm

He will find something to complain about on the next payout too. The McDonalds burgers are too expensive or the Lambo is to shinny for the payout.

snorkel · June 23, 2024, 12:38pm

I know you try to be constuctive and you have more knowledge and IT skills than me. I’m just messing with you to lighten the athmosphere.

Vadim · June 23, 2024, 2:45pm

Yes and as I written upper, if piece fail then reupload it to backup node, so all will be on piece level decision. May be if you have 1 file then it is slower, but as i see there is a hundreds of files all together it will make it work faster. If IP not respond then it fail upload and uploaded to next node, same piece. Even on big upload speed it should give better result as it will use bandwidth more usefully.

mgonzalezm · June 23, 2024, 6:33pm

@Pentium100 What IO scheduler are you using in you VM and the HOST?. Normally is advised to use something like NOOP in the VM and let the host to handle the IO scheduling. It is more efficient because the host is aware of the requests from all guests.

Pentium100 · June 23, 2024, 8:31pm

I use mq-deadline on the VM, it looked like I could get a bit more performance from it compared to noop, but maybe that was for Debian 10 and 12 does something different with it.

On the host its “none” for the zvol (no other choice) and mq-deadline for the drives. I seem to remember somewhere that zfs recommended noop on the drives at some point.

Thanks for the idea, I have something new to try.

Toyoo · June 23, 2024, 8:46pm

The deadline scheduler was designed for running directly on top of old rotational storage, assuming properties like no concurrency and latency depending directly on distance to target sector. Any intermediate block storage layer that breaks these assumptions has a decent chance of going exactly against assumptions made by this scheduler.

You are running RAID, so you actually can do concurrent reads (btw, even NCQ breaks this assumption). You are running ZFS, which breaks the latency assumption. As such, if the deadline scheduler was faster, I’d assume this was probably a fluke.

Pentium100 · June 23, 2024, 11:18pm

So, I change the scheduler to none (both in the VM and on the drives), restarted the node without sync enabled and it looks like it’s working fine.

Traffic looks about the same, maybe a tiny bit more (I really dislike the new node selection algorithm that does not give any information whether traffic to my node is reduced or not, it would be really nice if I could somehow get the “success rate” of being selected as the faster node).
I guess with SLOG, the sync operations do not really reduce the performance for me.

Alexey · June 24, 2024, 3:18am

So do you suggest to do not request another set of the nodes on upload failure of one of 80 pieces, as I initially thought, but still requests for example 110, but start uploads only to 80 and if one failed, take the next one from the remaining but not used 30?
How the uplink would select the fastest 80 from the provided 110 in your suggestion?
In the current scheme it’s happening naturally.

Each fail-select cycle would spend a time, which it could use for uploads instead, so the suggested optimization likely would increase the upload time. I’m not sure that it could be faster in general.

Pentium100 · June 24, 2024, 6:04am

It would depend on how the customer was uploading data.
Selecting uploading to 110 nodes and only keeping 80 pieces wastes up to 27% of the bandwidth (in addition to the data expansion, but we’ll ignore that). So, if the customer has 1gbps connection, he can only upload at about 730mbps while saturating the connection.
Uploading to 80 nodes and initiating another upload if one fails would reduce the wasted bandwidth.
It would make the upload of a single segment take longer on average. However, if multiple segments were uploaded at the same time, it could possibly speed up the total upload. Uplink could have an option to select the upload mode.

Alexey · June 24, 2024, 6:14am

Yes, the parallel uploads of segments the usual way how to upload faster and only limit is your upstream.

For the one segment though there is a limit of how fast the node can accept each piece. In case of parallel uploads of 110 pieces only 80 fastest ones will win the race. So there is not only failures compensated but also slowness compensation of the remained 30 in an expense of up to 27% wasted bandwidth (usually the long tail cut would happen much earlier).
So if the uplink would select wrongly, it might not have a maximum throughput for the segment, even if all 80 were succeed without a failure, but were just slow.

Pentium100 · June 24, 2024, 6:22am

With enough segments being uploaded in parallel, doing 80 initial uploads per segment would likely be faster (even if the segments themselves would take longer) due to increased bandwidth efficiency.
With few segments being uploaded in parallel (maybe it’s a small file) dong 110 initial uploads would be faster, because each segment would take less time to upload even if some bandwidth would be wasted.

Alexey · June 24, 2024, 6:47am

With the usage of 80 out of 110 the node selection algorithm may start to work wrongly - it wouldn’t have a correct success rate information anymore.

peem · June 24, 2024, 8:49am

@littleskunk still looking for speed?

one_node

The beginning of the graph - two nodes, both have available data space.
Second half of the graph - once the second node is full, the first node could take up all the bandwidth, but it doesn’t…
Why?

Vadim · June 24, 2024, 9:09am

Because speed of pattern gone down, but efficiency gone up, I see it on my nodes.

JWvdV · June 24, 2024, 9:09am

Don’t know whether he’s against potato nodes, but in each case he’s against bad choices. Fun to have him around, since teached us (or at least me) a lot.

Although he doesn’t know, he even convinced me to give ZFS a try. Just because you can separate metadata from the real data which, is really convenient for the filewalkers (after finding out the special devs).

But I’ll happy see talking him for himself.

That’s the first point to be aware of, and spot on. In general over time I’m becoming a bit wary of the tone chosen. Both sides by the way, and not personal to IsThisOn (because it’s not a point to him).

The primary point is that Storj is evolving over time. Evolving means changes. Our potato or just-not potato nodes turn out to be not keeping up with the new demands. We can be angry about it to Storj, but they can’t help it either. And in favor of growth and the future I can’t blame them for the choice to accept the fact that many nodes can’t keep up. Like we had in the fact with for example SMR.

Although, they might consider to make some changes in the software like the choice-of-n principles and so on, I think we should warn this thinking it’s a set-and-forget project.

But let’s respect the fact, some people like me are sticking by the “use what you have”-principle and change things with it to keep it working. Just because it’s a hobby, which brings me a lot of pleasure to learn new things. For which I thank a lot of people here. But those with potato hardware -like me- please be polite. Don’t blame the team for choices they make and keep the tone constructive.

I would say, some warnings and requirements should be added to this page @Alexey . Some things with meanings like:

Due to the evolvements of STORJ as a company, requirements may change over time.
This is not a fully set-and-forget project, due to the changing nature of the project not and then it take some time to invest.
If future demands stretches further than your hardware can keep up with, you can chose to revert to better hardware (of you’re willing), decide to quite (preferably using a graceful exit) or just risk losing your node.

Essentially I think these growth pains are to be expected, should be warned for and should be greeted with some calmness.

brainstorm · June 24, 2024, 11:30am

could you point to the information you are referring to ?
or is that just the practice of having one zfs dataset for the data, and one for the storj db files, with the db dataset configured for optimal use by usual databases ?

JWvdV · June 24, 2024, 11:50am

No, the databases were on SSD already.

The essence had been described here, including some benchmarks: Best filesystem for storj

Splitting data (on HDD) and metadata (on SSD) potentially gives a tremendous performance boost. Although random IO could be worse, the tests shown in the topic are completely synthetic. Files of 512b are random IO, but most files are much bigger and not so much random IO.