I did not force sync on zfs, I re-enabled sync in the node settings.
29 posts were split to a new topic: ZFS speed and optimizations for VM
Congrats! And thank [$DIETY] ā¦ I thought youād never be satisfied. Enjoy your next payout!
He will find something to complain about on the next payout too. The McDonalds burgers are too expensive or the Lambo is to shinny for the payout.
I know you try to be constuctive and you have more knowledge and IT skills than me. Iām just messing with you to lighten the athmosphere.
Yes and as I written upper, if piece fail then reupload it to backup node, so all will be on piece level decision. May be if you have 1 file then it is slower, but as i see there is a hundreds of files all together it will make it work faster. If IP not respond then it fail upload and uploaded to next node, same piece. Even on big upload speed it should give better result as it will use bandwidth more usefully.
@Pentium100 What IO scheduler are you using in you VM and the HOST?. Normally is advised to use something like NOOP in the VM and let the host to handle the IO scheduling. It is more efficient because the host is aware of the requests from all guests.
I use mq-deadline on the VM, it looked like I could get a bit more performance from it compared to noop, but maybe that was for Debian 10 and 12 does something different with it.
On the host its ānoneā for the zvol (no other choice) and mq-deadline for the drives. I seem to remember somewhere that zfs recommended noop on the drives at some point.
Thanks for the idea, I have something new to try.
The deadline scheduler was designed for running directly on top of old rotational storage, assuming properties like no concurrency and latency depending directly on distance to target sector. Any intermediate block storage layer that breaks these assumptions has a decent chance of going exactly against assumptions made by this scheduler.
You are running RAID, so you actually can do concurrent reads (btw, even NCQ breaks this assumption). You are running ZFS, which breaks the latency assumption. As such, if the deadline scheduler was faster, Iād assume this was probably a fluke.
So, I change the scheduler to none (both in the VM and on the drives), restarted the node without sync enabled and it looks like itās working fine.
Traffic looks about the same, maybe a tiny bit more (I really dislike the new node selection algorithm that does not give any information whether traffic to my node is reduced or not, it would be really nice if I could somehow get the āsuccess rateā of being selected as the faster node).
I guess with SLOG, the sync operations do not really reduce the performance for me.
So do you suggest to do not request another set of the nodes on upload failure of one of 80 pieces, as I initially thought, but still requests for example 110, but start uploads only to 80 and if one failed, take the next one from the remaining but not used 30?
How the uplink would select the fastest 80 from the provided 110 in your suggestion?
In the current scheme itās happening naturally.
Each fail-select cycle would spend a time, which it could use for uploads instead, so the suggested optimization likely would increase the upload time. Iām not sure that it could be faster in general.
It would depend on how the customer was uploading data.
Selecting uploading to 110 nodes and only keeping 80 pieces wastes up to 27% of the bandwidth (in addition to the data expansion, but weāll ignore that). So, if the customer has 1gbps connection, he can only upload at about 730mbps while saturating the connection.
Uploading to 80 nodes and initiating another upload if one fails would reduce the wasted bandwidth.
It would make the upload of a single segment take longer on average. However, if multiple segments were uploaded at the same time, it could possibly speed up the total upload. Uplink could have an option to select the upload mode.
Yes, the parallel uploads of segments the usual way how to upload faster and only limit is your upstream.
For the one segment though there is a limit of how fast the node can accept each piece. In case of parallel uploads of 110 pieces only 80 fastest ones will win the race. So there is not only failures compensated but also slowness compensation of the remained 30 in an expense of up to 27% wasted bandwidth (usually the long tail cut would happen much earlier).
So if the uplink would select wrongly, it might not have a maximum throughput for the segment, even if all 80 were succeed without a failure, but were just slow.
With enough segments being uploaded in parallel, doing 80 initial uploads per segment would likely be faster (even if the segments themselves would take longer) due to increased bandwidth efficiency.
With few segments being uploaded in parallel (maybe itās a small file) dong 110 initial uploads would be faster, because each segment would take less time to upload even if some bandwidth would be wasted.
With the usage of 80 out of 110 the node selection algorithm may start to work wrongly - it wouldnāt have a correct success rate information anymore.
@littleskunk still looking for speed?
The beginning of the graph - two nodes, both have available data space.
Second half of the graph - once the second node is full, the first node could take up all the bandwidth, but it doesnātā¦
Why?
Because speed of pattern gone down, but efficiency gone up, I see it on my nodes.
Donāt know whether heās against potato nodes, but in each case heās against bad choices. Fun to have him around, since teached us (or at least me) a lot.
Although he doesnāt know, he even convinced me to give ZFS a try. Just because you can separate metadata from the real data which, is really convenient for the filewalkers (after finding out the special devs).
But Iāll happy see talking him for himself.
Thatās the first point to be aware of, and spot on. In general over time Iām becoming a bit wary of the tone chosen. Both sides by the way, and not personal to IsThisOn (because itās not a point to him).
The primary point is that Storj is evolving over time. Evolving means changes. Our potato or just-not potato nodes turn out to be not keeping up with the new demands. We can be angry about it to Storj, but they canāt help it either. And in favor of growth and the future I canāt blame them for the choice to accept the fact that many nodes canāt keep up. Like we had in the fact with for example SMR.
Although, they might consider to make some changes in the software like the choice-of-n principles and so on, I think we should warn this thinking itās a set-and-forget project.
But letās respect the fact, some people like me are sticking by the āuse what you haveā-principle and change things with it to keep it working. Just because itās a hobby, which brings me a lot of pleasure to learn new things. For which I thank a lot of people here. But those with potato hardware -like me- please be polite. Donāt blame the team for choices they make and keep the tone constructive.
I would say, some warnings and requirements should be added to this page @Alexey . Some things with meanings like:
- Due to the evolvements of STORJ as a company, requirements may change over time.
- This is not a fully set-and-forget project, due to the changing nature of the project not and then it take some time to invest.
- If future demands stretches further than your hardware can keep up with, you can chose to revert to better hardware (of youāre willing), decide to quite (preferably using a graceful exit) or just risk losing your node.
Essentially I think these growth pains are to be expected, should be warned for and should be greeted with some calmness.
could you point to the information you are referring to ?
or is that just the practice of having one zfs dataset for the data, and one for the storj db files, with the db dataset configured for optimal use by usual databases ?
No, the databases were on SSD already.
The essence had been described here, including some benchmarks: Best filesystem for storj
Splitting data (on HDD) and metadata (on SSD) potentially gives a tremendous performance boost. Although random IO could be worse, the tests shown in the topic are completely synthetic. Files of 512b are random IO, but most files are much bigger and not so much random IO.