Machine freezing for short periods

SGC · June 12, 2020, 3:40pm

are you sure it’s the buffer thing…

now you check that it isn’t happening still with a normal write buffer level, then if it doesn’t you turn the write buffer up again and see if it comes back…

else you are most likely just guessing…
i do find it kinda odd why it should kill the node to go beyond a certain point in a write buffer…
however there might be a fix max size on the write buffer for some reason and if one goes beyond that it crashes… maybe

10gbit nic configurations also have buffers and they are generally not very large… making really large buffers can create other issues… such as like say your disk can write at 120mb/s
so a 256mb write buffer would mean sync data would be written to the buffer and then it would take 2 seconds to even flush the buffer to disk…

meaning new incoming data would be atleast 2 seconds + old when it finally gets to disk… most likely much more…

thus if the buffer system isn’t really smart and your application demands a db write to disk, and then checks if its been written to the drive… it will not see it because it’s still in the buffer, thus leading to all kinds of weirdness…

i know this example might not be accurate or real, but i tried to use numbers and time people can relate to.

Mad_Max · June 13, 2020, 3:59am

No magic just: No any virtualization (“bare metal”) + good disk cache with enough RAM.

At some point almost all filesystem metadata (folders, filenames, file-sizes, security descriptors, etc) loaded into RAM and reads from RAM after it, not from actual disk. Reading from disk cache is about 100-1000 times faster.

Also OS and Storj software (including logs files which written frequintly) are located on SSD, not same HDD. HDD contains almost only Storj data (+ some cold data which almost never used) so it dedicated for serving Storj I/O without interruption for other load. Although Storj Database is still on same HDD with main storage as I did NOT test moving DB to custom path on SSD yet( as i am pretty fine with current performance and stability, so “dont fix if it is not broken”).

Although these numbers (3-10 min) its for “soft restart” or “warm start”- then i restart just Storj Node software, without restarting OS or rebooting computer itself.

“Cold start” (after computer reboots) takes significantly longer, but its still in 10-20 min range i think. Although i am not 100% sure about “cold” numbers is still actual as i did not have “cold starts” for about 1.5 month already. And during this time there were 2 software updates of Storj and amount of stored data increased significantly since last cold start.

Here i did some quick test. I used script to list and count ALL files of one of my nodes. Its taken less 2 min to go trough all 1.45 TB of stored data in ~1.1 Million files (and storj node was running all along so some uploads/download and DB ops was served from same disk in parallel as it happens during actual node startup):

Folders :      6151      6151         0         0         0         0
Files :   1 119 465   1 119 465         0         0         0         0
Bytes :   1.454 t   1.454 t         0         0         0         0
Time :   0:01:49   0:00:00                       0:00:00   0:01:49

Mad_Max · June 13, 2020, 4:30am

Yes, I am using Windows with native Storj app (no Docker or any virtualization).
Disk is just flat NTFS without any tweaking (only “service priority over user apps” in system settings which include a larger disk cache)
One SSD (for OS, temp, swap, and software including logs) + one or two HDDs for data. no any arrays.

112k per sec is just great! I never saw more than 30-40k IOPS on my machines and with most of them served from a cache. I do not know why it takes so long then on your system. With up to 100k reads per second node should be able go trough blob storage of few millions files in just 1 or few mins, not few hours. It just 1 or 2 read op per each file and folder after all.

SGC · June 13, 2020, 5:05am

its a rather new problem… i think not 1½ months old…
so that would sort of explain it…

because all the stuff you say i have equal or better…
still takes my system 1½ hours i think…

but the new update is out soon so then ill check how long it takes to reboot it before the iowait drops.

well 112k iops for this system is kinda weak… granted my iops for my hdd’s are more like 800-2000, for reads… but with all my caching i should easily be able to get like in the million range or near… so to be fair i’ve been working a lot on trying to make it better…

anyways… from what i can see it doesn’t matter if its cold start or not… the node runs something that using a ton of disk iops and thus produces iowait… and takes forever… from my tests even if i shut down the node just after it finish’s it and start it again then it restarts the process… and my system is like a sponge for datacaching… everything gets cached…
i’m like 10 day into my uptime and system is still in the process of training the caches.

i can reboot a windows server vm in like … maybe a second or two… takes so little time that it will be ready before i’m done clicking or typing …

if i was to hazard a guess on that it’s doing then it might be something like database clearning, it would be the only thing that sort of makes sense to me… because i don’t understand how it can produces so much iowait twice in a row…

i’m sure storj will get it fixed eventually… atleast it shouldn’t run twice in a row whatever it is…

Mad_Max · June 13, 2020, 5:54am

I am not sure that they even know that they need to fix something in the first place

Most IOPS after node start produced just by file listing/counting.

I have a guess, given that you have a great read ops speed and bad writing…
My be you have turned on some of files system features like modifying some files metadata (like “last accessed” timestamp for example) even if files not opened by process and it just reads files metadata while scanning storage after startup? And each file produce write request in additions to read ops? And thus iowaits is actually for writing and not for reading.
Or it somehow triggers ZFS checksum test for example.

I can understand why it can take few hours on Rpi - without good caching on a single HDD with just few hundreds IOPS provided by disk itself it can easily take few hours.
But with proper caching on powerful enough system it should take just some minutes, not hours. If it isn’t then probably problem somewhere in your system/setup, not in the storj app.

And what do you mean by “rather new problem” ? Node version? But i run both relatively old (v 1.3.3) and current latest (v 1.6.3.) versions. And regarding disk IOPS usage including startup and do not see any significant difference. For example i just cheeked my bigger (~4 TB of data) node running on 1.6.3 and OS stats says it produced just about 3.5 millions read ops and 5 mil write ops since very start which was about 50 hours ago.
And other node running on 1.3.3: 3.5 mil read + 4 mil write ops in the same time span of about last two days.

And 7-9 mil I/O ops per 2 day (including pretty active ingress all this time) is not that much even for a single HDD setup. Altrough is can be a problem if that HDD is based on SMR technology.

SGC · June 13, 2020, 7:49am

don’t think i noticed the increased iops upon boot since maybe 1.4.2 tho could be 1.3.3
well i’m current migrating my 9tb+ node… AGAIN lol
last time it was only maybe 6tb anyways doing a full check and verification of all the node files i can do in not much time… will know in a day or two how long it takes exactly…
and my writes goes to a dedicated slog device (os / slog ssd)
which should be able to do 56k iops but i can only get it to do 4k
but i’ve had a 4kn / 512n or sas / sata or both.
which is why i’m migrating…

you can only do millions of iops in ram… and fast ram at that… i think a 250$ io accelerator ssd card will get you 1.5mil and thats either ddr4 or ddr5 based
ofc that only uses 1 or 2 modules… so the internal ram on the cpu would be able to do upwards of the 5mil + depending on the number of ram modules per cpu you got…

if you turn off sync then you can run the node like that… but then crashing would most likely cause a lot of damage to the node…

i think my system usually does like 1000 reads pr second during regular node operation maybe a bit less or a bit more… i cannot remember the exact number… 3600sec in an hour so 3.6mil… so 8-9mil 50 hr lets call it 10mil so 1mil in 5 so 200k in 1hr or 3600sec so thats like 50 iops avg over 50 hr … thats like 1/4 of what a standart hdd can write… which i believe is like 200 iops

ofc hdd’s are tricky because random read / writes can really kill performance

the current reads i’m doing because of migration…
so if we just call it 10k pr sec thats 36million reads in an hour, ofc thats not what the node is doing… its because i’m running rsync and in the final process of verifying that everything is in sync before shutting down the node for the final rsync and it doesn’t reach peak reads because it will find something to copy and then continue afterwards… have seen it pass 120k today.
but still since much of this is in ram… it should be faster… hell even my ssd’s should be able to be in this range… but yeah still getting use to zfs

no doubt a rpi can run into many issues… just like having the data on only one hdd is imo a bad plan… that is doomed to fail…

don’t even get me started on smr lol

Pac · June 13, 2020, 1:01pm

I’ve got the strange feeling that I’m in the middle of a figure battle between 2 giants having killing systems with huge amounts of RAM and millions of iops all over the place…

No wonder my poor RPi4 sucks like hell at handling millions of files.
When my disk counts files with ncdu for instance, it’s more like 150 to 270 io per seconds.

Ran the test this morning:

pi@raspberrypi:~/storj/mounts/storj_node_1 $ time ncdu -o ~/mount1-listing.txt
/home/pi/storj/mounts/storj_n...orage/temp/blob-868128123.partial  1409723 files

real    117m56.912s
user    0m20.811s
sys     6m8.238s

So, roughly 200 files per seconds.
Yay.

SGC · June 13, 2020, 1:46pm

well it’s not really the rpi’s fault that a hdd can do something like 200 iops… supposedly 400 read… but i duno about that… maybe it can do read and write iops equal to 400… i duno… so many details its not easy to keep track…

i’m not much better than that… working on setting up so i’m running a pool of 3 x 3 drive raidz1
each raidz1 gives me 1 disk worth of iops, thus when i’m done making it… and then comes all the caching on top.

zfs will use ram and or ssd’s for caching in tiers, so you can setup a write ssd cache, a excess memory swap cache of sorts… and then it will try to keep anything it uses frequently or recently in the memory.

so first time i would run the du command after reboot it will most likely run at about 3 times the speed when i got this working… and if zfs can figure out what its doing, which is pretty likely then it will start to cheat and read ahead… predicting what will be needed in the future… which will start to speed it up… 2nd time it runs it will be in memory… and in my case i use a 600gb ssd l2arc which means i end up having anything that will be accessed even once a week or once a month in the ssd cache.
the OS is also placed on an ssd, so not to disrupt the pool… mostly because then one can boot even if the pool has issues… which is pretty good

tho right now i’m copying my node back from 2 drives in a span for temp storage…
but for some odd reason i only get like 50mb/s

which is kinda weird… since both drives should be able to do like 120-150, maybe its something to do with how zfs does the whole 2 disk as one…
so yeah it’s not always an advantage… and zfs does want a bit of resources and can use basically however much ram you throw at it…

@Mad_Max just restarted the node… cold start after reboot… tho with a copy running… took 30 min… so not nearly as bad as i remember, ofc before i was running 4k blocksizes on 512n hardware… so that might give me a huge io amplification… everything seems to run much smoother now…

Mad_Max · June 16, 2020, 1:00am

Its not such a powerful in my case. Just full sized (ATX motherboards in midi-tower cases) home-build servers from the very common and relatively cheap parts.
One server is an AMD FX-8320 (4/8 cores/threads @ 4GHz) + 2x4 GB DDR-3 RAM + 1 SSD + 1 HDD
Another is an AMD Ryzen 2700 (8/16 c/t @ 3.3 GHz) + 2x8 GB DDR-4 RAM + 1 SSD + 2 HDDs
All HDDs works as stand alone drives (no any arrays), no any fancy filesystems (just plain NTFS) or custom system software. SSDs works as system/software drives and HDDs as data drives. Pretty common setups for good home destop computer (actualy one is a my former home destop repurposed for work as a server after i build a new one for own use)

Also these servers is NOT dedicated to Storj alone, it pretty heavily loaded with other tasks, CPU loads is about 80-95% all the time. Although most of the load runs in low priority (its from BOINC distributed computation network) and a gave a high process priority to Storj service.

I think it is all about proper disk caching. Window behaves a quite aggressive about caching even in default config. Usually it just toss almost all currently unused RAM to disk cache leaving just 500-1000 MB for immediate use (but still shows this RAM as free, while it not actually free, it “will free it at first demand”).
At the beginning (“cold start” right after system reboot) i also see very similar disk performance about ~200 ops per second. At its just puny!
But it increases fast as disk cache begins to fills with data. And after some time such things as files listing/counting or “warm” Storj node startup reach speed of about 20-40 thousands ops per second(not a 4k synthetic test, it from a monitoring a real running apps including Storj).
Or 100-200 times faster compared to initial speed seen right after reboot.

I did another test with reboot: first listing of 1.1 million files right after reboot (“cold start” with empty caches) took about 7-10 min. It begins at about 200 ops/s at beginning and >20к ops at the end of the process.
And second (“warm start”) takes less 2 min again as it was in previous test.

SGC · June 16, 2020, 9:13am

to test my iops i was playing around with copying 1mil empty files … xD not an exact measure of iops… but a pretty good approximation of of max iops performance and how well the system performs when dealing with high IO tasks, like tons of sync database writes.

and i kinda wanted to push the copy time down… since it took ages for my system…
i think each file was 10 bytes