Copying node data is abysmally slow

TechAUmNu · June 10, 2022, 10:05pm

The vast majority of the files in the blobs folder are <1MB, for a 20TB node this starts to become a very very big problem.
Is there any way in the future we can have the files be larger, or have the node allocate larger blocks on the disk and manage the data without relying so much on the file system. Even if it was just 100MB/1GB blocks that would be a massive improvement.

When it comes to the point you need to migrate the data to another disk it absolutely hammers the disk with all the small files. This particular node has some 30 million files. I don’t want to have to spend 10 days copying files when the disk is dodgy. Over a network transfer it can take significantly longer as well, It took almost a full month to copy an 18TB node between two servers.

I would be fine with the small files if everything was on an SSD, but for a hard drive this is just brutal.

bre · June 10, 2022, 10:43pm

littleskunk made an excellent detailed answer here

github.com/storj/storj

Terribly ineffiecient storage; too many tiny files

opened 11:27PM - 01 Jun 22 UTC

closed 10:38PM - 02 Jun 22 UTC

bathrobehero

Bug

I have over 2,5 million files for 512GB of storage (607GB on disk) which is just… absolutely terribly inefficient even on SSD storage. And it's pure pain on HDD. As many professionals are working on this project, it's baffling that file system optimizations were completely ignored. I love the project and been following since 2017 and recently started a node, but I'm closing that down until it's fixed. There's absolutely zero sane reason to store files in such tiny chunks. It's terrible for performance in every respect. The only thing that it makes it suck just a bit less is that files are stored in 10,6k folders instead of a few... And it's a miracle the client is not eating GBs of RAM (yet). It's only eating ~570MB so far but I have no clue how that would scale. For a project this size, an option should have been made long ago that would allow users to decide the size of chunks they wanted to store data (64/128MB/256/512/1024MB/etc.) without killing file access time, MBR bloat and just general performance on NTFS and most other file systems. I was planning to store 50+ TB of storage but I'm absolutely not going to deal with the pain of storing ~250 million files!

Which links to this forum post as well

littleskunk · June 10, 2022, 11:23pm

I would recommend using ZFS. It will cache all the small files in RAM and flush them to disk as one big write operation. I copied some data using rclone from one disk to another and even with all the small files I was seeing about 40 MByte/s. I didn’t play around with all the options ZFS has. Maybe there is a way to get even higher performance out of it.

TechAUmNu · June 11, 2022, 1:35am

I already use a 200GB memory cache which defers writes to improve performance, the referenced issue is more a rant than anything particularly helpful, so I agree that you closed it.

My point is more aimed at the situation when you need to replace the drive for the node because it has started degrading. The drive could fail imminently so you don’t want to be spending ages getting data off before it dies.

40MByte/s is decent, I was seeing slightly slower speeds likely because the node is very old so the files are all over the disk and the node was still running while performing the transfer.
Consequently at 40MB/s for a 20TB node the copy would take over 5 days to complete. But sequential performance on a typical node using high capacity drives is >300MB/s, so the same transfer would finish in less than 1 day.

Just did a test and can copy 3GB of these small files at 250MB/s once freshly copied to another disk. Copying them directly off the disk the node has been running on for a long time, only manages 50-60MB/s. So it looks like its more a fragmentation issue.

SGC · June 11, 2022, 7:54am

keep in mind large capacity disks usually also have large capacity caches which will offset smaller tests as when copying only a few GB …
like say 512MB cache would be 1/6th of 3GB and thus your benchmark would have a minimum deviation of 16% and most likely much more.

there are ways to copy the data sequentially, but that would require shutting down the node, but atleast in theory you should be able to use the dd command for sequential copy of the partition.

but yeah storage node data copy is a heavy workload, the only way around it is to use something that will copy the disk data sequentially, which comes with its own difficulties.

older large storage nodes takes a long time to get to size and might be worth considering storing on a raid, ofc a raid also limits how many iops the hardware can do, because all disks run synchronously.

so really there is no good options, big data is a lot of trouble.
5 days for 20TB is pretty fast actually

Iigloo · June 11, 2022, 9:49am

I had this issue too, trying to migrate my 3,6TB node to a usb drive 16 TB. It took days to rsync. So I skipped it.

Toyoo · June 11, 2022, 9:58am

The thing here is, whatever the way to merge customer data into larger blobs would be implemented, the storage node would need to have a way to selectively delete a single customer file out of a blob. If the storage node would then need to rewrite the whole blob to delete a 4kB file, then you’d have immense write amplification.

Besides, the file system is exactly the layer that is supposed to solve this problem. It is a large blob of data that is supposed to flexibly organize small files—no point in reimplementing the same logic inside storage node. For example, if you can set up a node on top of an LVM logical volume, then migrating such a volume would be a simple and fast task. Or use a smart tool for copying file system data, like e2image or ntfsclone.

SGC · June 11, 2022, 5:31pm

currently some of the numbers are just nonsense and chosen for the wrong reasons.
it will become better when the new solution gets implemented.
and everyone will benefit, working against the hardware is just a bad idea.

sure there will be a bit bigger blocks, but that doesn’t mean it has to be bad.
any modern hdd or other storage will work with a minimum of 4k.
so writing smaller blocks makes no sense.

especially if the IO of the storagenodes is part of what limits network speeds.

think about it for a moment a hdd will do less than 400 iops random read writes.
so 4k x 400 is like 1.6MB/s or 13Mbit/s

so grouping small io together really has no detrimental effects, because the HDD’s are the true limitation in 99% of all cases.

also the scale of how the blocks grows is off… they don’t increase by a factor of 4k and thus there is a mismatch between native HDD IO and Storj Network IO

if you write 6k then HDD IO will still be 2 x 4k sectors written, or read…

TechAUmNu · June 18, 2022, 3:59pm

Righto, moving from Windows server and storage spaces over to TrueNAS Scale and ZFS. I have been using a simple pool on storage spaces which scares me a lot, and since its basically a JBOD the performance is pretty dire.

Trying to decide what the block size should be, the default for ZFS is 128k, which seems reasonable I guess. But it would probably do better with a smaller size.
Here is the amount of files and sizes for one of my nodes.

So 128k blocks would give ~10% write amplification.

S0litiare · June 19, 2022, 1:39am

Just spent the day moving my node from a 4Tb disk to a 10Tb one…
the initial sync took nearly 3 days, then each sync after that only took 2 to 3 hours.

So it should have taken only a few hours to get the final sync done with the node offline.

But got distracted and I messed up my final rsync command, so instead of syncing the past day’s changes i started a new copy on the new drive. So wasted half a day before i realised the mistake. (didn’t loose any data just extra copies of the data)

So after finally getting the correct commands in to copy the latest changes I started the node…

Only to be hit by the good old “malformed database” errors. so that was another hour or so of checking every database and repairing them… I guess repeated rsyncs on running DB’s isn’t great!

So I’ll need to either exclude databases till the the node is offline or just set aside extra time to repair them if I ever need to move my node to a new drive.

Anyhoo it’s 2:30 am I’m heading to bed once i double check everything again…

striker43 · June 19, 2022, 6:15am

I had the same issue when I moved my nodes to other drives. In my experience the databases at the source were still good, only in the target they were malformed. My approach for node migration is now to use rsync like you, but before I do the last run, I copy the database files manually to a tmp location and after the last rsync I place the files in the target with these manually saved databases, so I make sure that the exact same file will be in the target path and in the source path. I don’t understand why rsync tells it copied the files successfully, but they are malformed then…

Alexey · June 19, 2022, 6:49am

@S0litiare and @striker43 The database corruption could happen, if you did not run the last rsync with --delete option (when the source node is stopped), because temp database files are merged to the database when the node is stopped, so these temp files must be removed from the destination, otherwise the sqlite3 will try to merge them again on the next start (when you start the destination node), as result - malformed databases.

@TechAUmNu The default record size maybe not good for storagenode’s data, see

TechAUmNu · June 28, 2022, 1:31pm

After a lot of research and finding a bunch of used drives on ebay I am resolving this issue by switching to a much better suited setup for this workload. Hopefully with this amount of redundancy it will be nice and resilient for the next few years at least and will give plenty of room to grow as dataflow on the network increases.

TrueNAS Scale on R720XD
ZFS record size 1MB with compression enabled
12x16TB EXOS (2x 6 disk RaidZ2 vdev)
3x1TB 970 EVO Plus NVME 3-way mirror Special vdev (metadata and small files <=64kB)
2x480GB S4610 2-way mirror SLOG (3PB of endurance)

I have a pile of the 16GB M10 Optane which might work better for the SLOG.

littleskunk · June 28, 2022, 3:54pm

You don’t need such a big SLOG. Just calculate how much data you have to write in 10 seconds and you have the required size for the SLOG. More space will not change anything. SLOG will still flush its content to the disk every 5 seconds.

I did this for a while and finally removed SLOG because by default ZFS will do the same magic just with RAM. The SLOG would only be needed if the storage node would store pieces with a sync call but it actually does only a lazy write operating that ZFS can cache for quite some time before flushing it to disk in the same way SLOG would do it.

SGC · June 28, 2022, 5:07pm

personally i think mirrored SLOG’s are overkill…
my reasoning being its barely ever used… atleast in most operation.
its essentially a backup system, and the SLOG is completely expendable until the day your system losses its mind, hard crash, power outage or such…

then if the SLOG data will be recovered rather than just be written to be deleted.
so really PLP feature on the SLOG SSD is IMHO a more important factor than a mirrored SLOG.
and i did have 140 random reboots with a non PLP SATA SSD SLOG and it worked fine…
i’m guessing it was fast enough written that there was never enough data missing to be noted.

the only good reason to do a mirrored slog is for a poorly monitored system… and thus one might not notice when the ssd is bad, and if its super critical data then having redundancy is ofc best… but its a bit of having redundant solution, on a redundant system, which rarely is used and in most cases … the damage incurred is also minimal even without a SLOG due to how CoW works.

i use 6GB for my SLOG… because i was pondering doing 10Gbit networking… so 5 sec x 1200MB/s equals a max of 6000MB (1200MB = 10Gbit) … like littleskunks says.
i think the max i’ve ever seen used is 400MB and really its usually a lot less… i guess because most writes aren’t sync writes… which is the only writes that go to the SLOG really.

@TechAUmNu i would strongly advice you revisit the specs for the M10 optane disks, since i’m fairly sure their write performance is sort of terrible… they are generally a read cache drive, which is where they shine, due to high iops / bandwidth and low latency.

i think my OS SSD doubles as my ZFS pool SLOG
have tinkered a ton with ZFS and the SLOG is … nice to have… but its not worth much, i wouldn’t do without it tho…
because the sync writes which are put towards the SLOG will instead be forced to the HDD’s if the SLOG is missing, which in high workloads will slow down the storage performance since it will be written in a less sequential way… i think…
but might not help much, and it is a bit into the weeds…

not all of my pools run with SLOG’s but the few important ones does

TechAUmNu · June 28, 2022, 6:38pm

Ok, so maybe use the S4610 as OS and SLOG then? They have full PLP so power loss is not an issue. I was originally going to use some M10 for the OS as most people seem to advice not to use the same disk for OS and SLOG.

Will probably just not use the M10 then and save some power / pcie lanes.

Pentium100 · June 28, 2022, 6:51pm

If you use SLOG, you should use a mirror, otherwise you risk losing data. If there is an unexpected reboot and the SLOG device fails at that time, you might lose the pool.

SGC · June 28, 2022, 7:19pm

i don’t think powerloss would ever destroy an entire pool, at worst you could loss a few files… thats about it… afaik
and so far even with insane abuse over multiple years and having many pools and disks, i have yet to see any data loss.

ofc if its super critical stuff and a slog is so tiny its not much of a waste… because it could be fitted on other stuff like an OS mirror… i should really get my OS on a mirror

oh the S4610 thats an intel DC drive, thats a pretty nice drive, sadly sata really slows down its performance. i got a P3600 S3600 and a S4600
don’t think any of them are the 10 models or such… but that is usually just a later slightly better version, my P3600 will outperform my S4600 in random IOPS Q1D1 by a wide margin, even tho the S4600 should be faster…
these intel drives are built like a tank… only managed to put 9% wear on my S4600 not for a lack of trying lol… now its hosting vm storage because i didn’t want to expose it to the amount of wear that i did… only had it for like a year…
its a great drive, much faster than the S3600
but the S3600 has an insane write endurance.
my OS is on the S3600 400GB because it was a nice size for that

TechAUmNu · September 8, 2022, 11:28pm

Ended up using a pair of Radian RMS-200 8GB PCIe NVRAM cards for the SLOG. Had to replace the fans to stop them overheating though, so not really great for low power.

Debating if having them is actually worth it given how almost all the data will be async writes.

SGC · September 9, 2022, 12:24am

These are insane overkill… that being said…
when i initially started using ZFS i did dig into the operation of it on larger scales, apparently to really utilize the SLOG devices to their full potential and to actually get the effects of a true write cache on ZFS, they will run the pool with
zfs set sync=always

this makes all writes synced and thus they will all be forced into the SLOG and then the data writes to the slower storage becomes more sequential rather than random IO.
supposedly this reduces fragmentation…

i tried to make this work for a long time… but truly random IO Q1D1 is a very heavy workload even for most modern SSD’s and thus i would always run into bottlenecks on my SSD’s

so it might be worth a shot, since these are RAM based they should easily do like 500k write IOPS at Q1D1

these days i just run my pool sync at standard and it seems to work just fine, but it’s really not seeing much use in that setting.

running sync always is basically failsafe, so long as your pool doesn’t die data integrity should be 99.9999% maybe even adding a few more 9’s…

and no matter how much resiliency ZFS has stuff can still go wrong, it would ofc be over the top… but i would be very interested in hearing how it would pan out.
was considering getting a device such as those…

a side note also… i don’t think there is any real benefit from running them as a mirror…
the SLOG is basically a redundant device that usually will only be used in case of power outages and other such ZFS emergencies…

else everything is in main memory anyways.
the reason major businesses and such might run these types of devices in a mirror is incase one goes bad and nobody notices…
but really with a mirror solution if one device dies or starts to misbehave the other one will wait for it… so again, very little reason to run two…

ofc if one wants to add some more 9’s to the data integrity %, i suppose it might help very very very rarely… but then we are at a scale of like banking information where a wrong byte could be a disaster…

which would most likely be caught by underlying software solutions anyways… so… yeah
i’m just ranting at this point lol