Machine freezing for short periods

willyc · April 10, 2020, 12:33pm

Looking at the way this is performing, I’m assuming data is written to the drive when received and appended until complete. IMO, the blobs should be stored in RAM and written to the drive when complete. There are multiple reasons for this.

A 2MB sequential write is much more efficient than (assuming 8k packet sizes) 250+ separate write operations.
Most uploads seems to be cancelled, so it most instances no data will ever need to be written to the drive.
Unnecessary drive writes increases wear and tear, temperature and therefore the risk of failure.
Unnecessary drive writes increase fragmentation and reduce performance.
Unless someones downstream bandwidth is faster than their drive, which is unlikely, this would hopefully solve any issues with SMR drive, which will become more and more common IMO as users look for cheap data storage rather than system drives.

The only advantage I can think of for writing partial blobs to the drive is if you want to persist the data so they can be resumed in the event the service shuts down and restarts. But the above points seems to be a massive trade off to enable the restarting of the service, especially as it’s safe to assume that in the event of a service restart you’d lose any race you’re in. I suppose you’d also marginally reduce RAM usage, but that seems like an unnecessary worry on most PCs.

SGC · April 10, 2020, 12:34pm

@willyc well that’s there custom settings and performance optimization comes in, you can do a ton of different things that might mitigate the problem…

making a nice big write cache for the drive on the ssd should make a huge difference, but if i had the issue i would check up on what other people that has experience dealing SMR drives have to say.

i purposely went out of my way to make sure i didn’t get SMR, so all i can do is guess.

willyc · April 10, 2020, 12:36pm

I don’t see how a cache would help when drive utilisation is 100% as the drive never gets an opportunity to clear the cache. My drive apparently has a ~20GB write cache built in.

SGC · April 10, 2020, 12:41pm

because the drive is slowed down by doing multiple smaller writes… you might be able to download a piece of software that does this, if there isn’t a feature for it somewhere else… OS, storj settings, filesystem/volume manager.

you should somewhere without to much trouble be able to make a write cache for the drive and then just see if that solves it… it might, it might not…
also another issue might be that the drive is interrupted to perform reads once in a while… that cannot help and you cannot have all the stored data in memory… so writes simply has to wait… ofc then that becomes dangerous because you will end up having lots of data in memory, which will be lost on a crash, which is why ssd caches are so popular and effective.

willyc · April 10, 2020, 1:14pm

Looking at the benchmarking from the review of this drive, we know that 100% utilisation and a lot of writes is a massive problem for SMR drives. This isn’t just my problem, this will be a problem for everyone running Storj on an SMR drive. I’m not looking to solve my problem, I’m looking to diagnose exactly what the problem is and help solve the problem for the project. Having seen the benchmarks above, I think that @Toyoo hunch that the SMR “feature” of the drive might be the issue was correct. I’m not really a hardware person and didn’t even know SMR was a thing.

I appreciate that you’re trying to help me solve my problem, that’s not really what I’m trying to solve. Storj is a fantastic example of what distributed technologies can do, and needs to be easy to set up by Mr Average. Telling Mr Average they need to create a cache on their SSD system drive isn’t a workable solution, and I’m fairly sure it wouldn’t solve the problem.

A brief CV…
I’m a software professional with 20+ years of experience and a lot of that time has been spent on performance optimisation, although mainly CPU/RAM. I could be of use on something like this. I’m happy to help test, but don’t really have time to dive in the project code and help develop at this current time.

Toyoo · April 10, 2020, 3:22pm

I believe that unless there’s some kind of an explicit fsync() or direct-mode access somewhere in Storj code, you should be able to count on the operating system that in most cases it will merge consecutive writes in a short timespan into a single write by the virtue of buffered file access. fsync() is very likely for SQLite code, but I did a cursory glance over the storage node code and couldn’t find it elsewhere. So assuming the cause is the SMR drive, my guess would be that the storage node does a proper job of streaming data to disk, but what kills performance are SQLite syncs—these would be pretty much enough to cause mayhem.

willyc · April 10, 2020, 5:07pm

If Storj is streaming from the socket straight to the disk, then the timespans wont necessarily be that short between writes, especially if it’s streaming 20+ uploads at the same time. If all uploads were streaming at the same speed on an average 40mpbs connection with 8k packets, then we’re looking at 3s between each 8k packet and the drive head having to jump to and write 20 different locations every 3s. I had a quick look to try and find the streaming code but couldn’t. If a Storj dev could tell me whether the sockets stream straight to the disk or to RAM, that would be appreciated? If it’s to the disk, then I think that should be resolved, if it’s to RAM, then I think it must be the database that’s the problem and I’d like to test moving the database to the SSD.

SGC · April 10, 2020, 6:35pm

Well i run zfs on a 5 disk raidz array, with a ssd l2arc… today my ssd only a regular sata model ran at 5-6% load most of the day, avg with avg response time of 3-4ms, but sometimes it has sizable spikes 10-100ms backlog according to netdata.

any often used, such as a database would be in my case loaded and saved to either ram, ssd, takes maybe days before my cache is filled after a server reboot.

my regular drives had 2% avg load today, ofc thats split between 5, and i think i’m running without my hdd disk cache’s on, haven’t gotten my raid control switch out for a HBA yet so i just ran it straight through.

but really 6-7% avg utilization over the better part of a day on my ssd
and that’s not to speak of what the ram managed to deal with.
and thats only running 1 storj node… O.o

not that i’m complaining… been getting 3.5 to 4mb a sec on ingress today
that’s a nice start and to fix the ssd performance issues, i might just find an old optane drive to throw in a spare pcie port…

seems a bit to me like the main thing that gives good successrate is low disk latency.
and if i’m at 100ms ssd disk latency at times, then some uploads is over before my disk even can respond.

but yeah performance optimization of storj wouldn’t be a bad thing… i think there is a lot of head room to gain there. but really got no clue, not a programmer, more of a network / systems builder.

i digress…
so long story short… i seriously doubt its a database issue… ^

apperently not the only one feeling a bit of pressure my dual xeon’s x5630 are only at 2% avg
but i did disable zfs compression and also why i only run raidz1(cheaper on the checksum / parity)
zfs compress on storj data is a waste of time, even a gzip-9 (highest possible compression) just gave my cpu a headache and didn’t seem to indicate much capacity saved if any… but want to retest that in a better way.

wouldn’t surprise me if they are all running software arrays with features they cannot support in any meaning full quality.

willyc · April 13, 2020, 10:24am

I added an issue on Github to try and involve developers in this discussion:

hoarder · April 13, 2020, 10:01pm

I moved a node around recently and had a chance to evaluate SMR drive performance with storj data.
Copying node contents between two SMR drives was extremely slow - speed ranging between 55 and 130GB/h, averaging 88GB/h. I had to temporarily disable data intake on the node as without it speed never went above 30GB/h.
Removing old node files was even worse - it took 6 hours to completely delete a node, speed averaged at 350GB/h.
Drives in question are ST8000AS0002.

willyc · May 6, 2020, 8:48am

1.3 seems to have resolved this problem. I posted an update on Github.

Toyoo · May 6, 2020, 5:58pm

Are you sure it is a result of an upgrade, and not just a result of less ingress traffic?

willyc · May 6, 2020, 6:11pm

Yup, I’m removed my concurrent requests limit, so I’ve got more coming in than before now. It’s all running smoothly (so far).

SGC · May 7, 2020, 12:25pm

i would keep an eye on it, stuff like that takes a while to settle… duno if the satellites keep track of rejections or something, but it sort of feels like it at times…

it’s also a matter of so long as your drives or drive can keep up, then its all good… but then when you get past the breaking point, it will go down to the 270kbyte pr second or whatever and that makes everything else stack up and then the drive is nearly permanently choked due to consistent ingress of data.

but whatever works

setting max concurrent at a low number did sort out the problem tho?
always nice to have some feedback so one can have an idea about if things work in the implementations one thinks.

i would say tho, that thus far i’ve had no good experience with letting the node run at high numbers of max concurrent… puts strain on way to many things… imo i run at 20 and seems thats more than enough

Pac · May 31, 2020, 8:06am

It’d be ideal if nodes could fine tune this setting dynamically themselves, in live (while running).

It could incrementally lower the number of concurrent requests when queries are starting to stack up. And do the opposite when the disk seems keeping up alright during the max number of concurrent requests being served.

However, I reckon the type of query (dowload/upload) can have a differrent load factor on the storage disk/media depending on its technology (pmr, smr, ssd, …), so maybe one single setting is not enough.

willyc · May 31, 2020, 9:58am

I was also thinking about the automation of this setting. If it’s very high, then even if the drive is keeping up fine, having the bandwidth spread more thinly will increase the number of lost races can cancelled uploads. The metric you want to be maximising is successful uploads/hour. But that’s not consistent due to changes in network usage, so it’s difficult to compare one day’s data to another.

I was thinking something like this might work.
Run at an initialisation value for a week. Say 30 to get W1 (successful uploads/hour).
Run at 29 for the following week to get W2.
If W1 > W2 try at 31 the next week.
If W2 > W1 try at 28 the next week.
Repeat.

The adjustments could also be fractions of a whole number, with the rounded number obviously being used for the actual number of connections, so it’d adjust more slowly to minimise the affect of noise.

Trying to optimise this value across the network would dramatically increase the efficiency of the network as a whole reducing wasted bandwidth and therefore latency.

Pac · May 31, 2020, 11:52am

I think auto-adjustments should be way quicker than from one week to another, because when an SMR disk struggles (for instance), the Node gets suspended pretty quickly

Couldn’t the Node watch the % of io-wait, as displayed in the top utility?

When this number gets high during a long time (80~90% for 5 or 10 minutes for instance) it’s usually a sign that something’s not right… but not necessarily the disk though.

Or even better, the disk usage where it is storing data, as shown by the iostat utility?

But I still think that the best metric the node should watch is simply the number of transactions it can serve/fulfill, versus the number of transactions requested by satellites. If there is a shift at some point (as in the Node cannot keep up: satellites requests start stacking up in a waiting queue), the Node should progressively but fairly quickly do something about it so satellites ease up pressure on the node.

Easy to say… I know

SGC · June 1, 2020, 9:22pm

don’t have much to say… but SMR drives are not for random read writes… they will work fine for reads, but they basically cannot read and write at any performance level… their write speeds are like 700kb/s when hitting them with their worst workloads…

ofc this is a problem that is being worked on, one solution could be tiered storage, so they are mostly used for reads of highly persistent files in the storage server…

Pac · June 6, 2020, 10:19pm

One of my nodes is on a smr drive, and every time sats hit it with a lot of ingress (as in data sent to my node) it quickly becomes unresponsive, with a load average of 700+ and the disk writing at less than 50kB/s.
I have to stop the node and change its settings so it does not accept anymore data, and then it works fine again…

Couldn’t nodes detect when the storage device is stalling? And tell sattelites so, so they pause using them until these nodes empty their queues of tasks, and get back to sats when they’re ready to get back to work?

nerdatwork · June 7, 2020, 1:48am

A good read: