Upcoming storage node improvements including benchmark tool

Past week several developers looked into storage node performance. We started the week with a benchmark tool that creates all the files a storage node would do at maximum speed. How fast can we write pieces to spinning disks? We know there are some low hanging fruites that would give us better storage node performance. How big is the performance gain? Which of the possible code changes gives us the biggest improvement? Is it worth the risk? So here I am showing you some early results of that work including a benchmark tool you can run on your storage node.

  1. Writing a piece to disk comes with an overhead. The storage node writes the piece into a temp folder first, renames and moves it into the final location and tells the filesystem to sync the piece to disk. We can shorten that process. Let’s write the piece once to the final location and don’t call fsync. This gives us an impressive performance improvement but without the fsync call losing power is a risk. → It will get implemented as a feature flag. By default storage nodes will take this shortcut but storage node operators can opt in to the old behavior with the extra safety the fsync call provides.

  2. Bandwidth tracking with an SQLite DB. Every upload writes into that SQLite DB. I don’t think we have come to a final conclusion yet. There have been talks about removing the bandwidth tracking from uploads. In that case storage nodes wouldn’t see that data on the storage node dashboard anymore. Later we could replace it with usage data provided by the satellite.
    Another attempt would be to batch the writes into the SQLite DB. If we are lucky that gives us similar performance gains and we can keep it. Further benchmarks needed to make a decision.

  3. TTL is also a SQLite DB. Not every upload has a TTL but the usecases we are looking into will use it more frequent. Again each TTL is a write into that SQLlite DB plus a cleanup job that later deletes these pieces. There have been talks to remove the TTL DB and use garbage collection instead. Don’t panic. I already provided the feedback… I believe garbage collection is already off the table. For now lets say further benchmarks are needed and I would expect that garbage collection is too expensive to use it for a TTL heavy usecase. I believe the latest benchmark is also batching the writes into the SQLite DB. I take that as further evidence that we can keep the advantages of a TTL DB over garbage collection.
    Another alternative would be to use flat text files instead of SQLite similar to how the storage node manages orders nowadays. I don’t know how far we tested this alternative nor how difficult it would be to implement it.

That is the top 3 so far. There are a few more experimental changes. I will try to post updates from time to time. However posting here isn’t my first priority so you will get the informations with a delay. My first priority will be to run the benchmark tests on my storage node and keep the feedback loop as fast as possible. The faster I can return some numbers the more time the developers have to improve it. I still hope can find a few minutes here and there to keep you updated.

Ok now to the benchmark tool that you are all waiting for. We don’t have binaries and it isn’t even merged. The pull request still gets updates frequently. So the installation is a bit more tricky this time:

git clone https://github.com/storj/storj
cd storj
git fetch https://review.dev.storj.io/storj/storj refs/changes/53/13053/16 && git checkout FETCH_HEAD
go install ./cmd/tools/piecestore-benchmark/

And now the execution:

cd /mnt/hdd
mkdir benchmark
cd benchmark
piecestore-benchmark  -pieces-to-upload 10000
# copy the result and potentially look into the tracefile the tool creates
rm -r *
piecestore-benchmark  -pieces-to-upload 10000 --disable-sync
# the expected performance gain with fsync disabled. Again the tracefile can be interresting.

You might have to start with a lower piececount. My target was to hit something between 1 and 10 minutes of runtime.

For fun I also tried different ZFS settings. The benchmark tool is really useful to try out different settings. But this first version of the benchmark isn’t the final one. I will have to repeat optimizing my system with later more accurate versions of the benchmark. One more detail. This benchmark is using TTLs for all uploaded pieces. So the results it gives you are not representing your current performance. Same goes for the maximum performance gain.

There are also some other params to play with. I still need to run the benchmark with higher concurrency and different piece sizes. Just to question if the expected performance gain will be the same across the spectrum or worst case the benchmark is too far off from the concurrency or piece size a production node would see. By testing out some extreme values I will get better feeling which params have an measurable impact on performance and which don’t.

16 Likes

Thank you, this is really important! I’m glad that many ideas for improvement that were considered on the forum are still considered by Storj. For reference I’m linking some of the posts that discussed them.

I’ve got my own patch that updates bandwidth.db as orders are sent, based on data from the orders file. It also aggregates data before inserting, so that the database has less to do. I haven’t actually measured how much, or whether it helps. I’m not running it on my nodes either. Though, I believe this would be the cheapest way of implementing this feature.

Comparison of these alternatives would certainly be interesting!

5 Likes

Thanks for the update, very excited to see development efforts towards improving node performance.

I have run the benchmark just go see what’s going on. It appears to upload, by default, 592 MiB.

First, a 2TB Gen 4 NVMe SSD on XFS just to see what we are looking at for best case:

uploaded 10000 pieces in 11.170603163s (52.99 MiB/s)
collected 10000 pieces in 1.183242348s (500.26 MiB/s)

-disable-sync:

uploaded 10000 pieces in 4.804178384s (123.21 MiB/s)
collected 10000 pieces in 1.097629311s (539.28 MiB/s)

Next, on a 16TB 5400RPM WD HDD on ZFS (atime=off, ashift=12):

uploaded 10000 pieces in 10m19.804190076s (0.96 MiB/s)
collected 10000 pieces in 30.651604559s (19.31 MiB/s)

-disable-sync:

uploaded 10000 pieces in 17.507504232s (33.81 MiB/s)
collected 10000 pieces in 34.827751212s (17.00 MiB/s)

16TB WD 5400RPM HDD on ZFS but with (sync=disabled, primarycache=metadata, atime=off, ashift=12), 2TB free :

uploaded 10000 pieces in 52.105770351s (11.36 MiB/s)
collected 10000 pieces in 3.472993569s (170.44 MiB/s)

-disable-sync:

uploaded 10000 pieces in 48.629827529s (12.17 MiB/s)
collected 10000 pieces in 3.252231606s (182.01 MiB/s)

14TB WD 5400RPM HDD on XFS (rw,noatime,nouuid,attr2,inode64,logbufs=8,logbsize=32k,noquota), 2TB free:

uploaded 10000 pieces in 18m5.630085582s (0.55 MiB/s)
collected 10000 pieces in 13.479955201s (43.91 MiB/s)

-disable-sync:

uploaded 10000 pieces in 14m13.397238796s (0.69 MiB/s)
collected 10000 pieces in 10.788055713s (54.87 MiB/s)

About 30x improvement when using --disable-sync in certain cases. With ZFS sync=disabled it seems to be slower than with it enabled. I’ve run this multiple times to verify.

I’m not sure if currently the databases and the IO used by them are also purely in RAM. Maybe a new feature would be to replicate the current node’s ability to store the DBs in a different folder.

Something might be up with my XFS disk. Hope someone else can also run this on XFS.

4 Likes

Hey! great job! are U planning such a testing tool for windows node operators someday?:slight_smile:

An old 2016 Intel DC P3700 Series 2 TB SSD PCIe 3.0 addon card running EXT4 with atime=off - db’s for all nodes are placed on this SSD :

uploaded 10000 pieces in 4.245792625s (139.41 MiB/s)
collected 10000 pieces in 3.982914422s (148.62 MiB/s)

--disable-sync:

uploaded 10000 pieces in 3.722261472s (159.02 MiB/s)
collected 10000 pieces in 4.048562214s (146.21 MiB/s)

20 TB Seagate Exos X20 SAS (ST20000NM002D) on a braodcom 9480-8i8e ctrl. running EXT4 with atime=off :

uploaded 10000 pieces in 9.654876194s (61.31 MiB/s)
collected 10000 pieces in 3.319280303s (178.33 MiB/s)

--disable-sync:

uploaded 10000 pieces in 4.330196856s (136.70 MiB/s)
collected 10000 pieces in 3.074334079s (192.54 MiB/s)

— Update – :

Bcache setup :
backing device : 8 x 24 TB Seagate x24 SAS (ST24000NM007H) on a broadcom 9670W-16i ctrl. in RAID0
cache device : 4 x 6,4 TB Kioxia CD8-V NVMe (KCD81VUG6T40) on another broadcom 9670W-16i ctrl. in RAID0
Filesystem is XFS with relatime, attr2, inode64, logbufs=8, logbsize=32k, noquota

uploaded 10000 pieces in 4.251765411s (139.22 MiB/s)
collected 10000 pieces in 4.025170484s (147.06 MiB/s)

--disable-sync:

uploaded 10000 pieces in 3.818703243s (155.01 MiB/s)
collected 10000 pieces in 4.112259897s (143.94 MiB/s)

8 x Samsung SSD 860 QVO 4TB on a broadcom 9560-16i ctrl. in RAID6 running EXT4 with relatime,stripe=256 :

uploaded 10000 pieces in 3.314124315s (178.61 MiB/s)
collected 10000 pieces in 1.345277041s (440.00 MiB/s)

--disable-sync:

uploaded 10000 pieces in 2.798058083s (211.55 MiB/s)
collected 10000 pieces in 1.202291934s (492.33 MiB/s)

Samsung 980 PRO with Heatsink 1TB using onboard M.2 PCIe running EXT4 with relatime :

uploaded 10000 pieces in 38.000706174s (15.58 MiB/s)
collected 10000 pieces in 2.39555638s (247.09 MiB/s)

--disable-sync:

uploaded 10000 pieces in 5.491355055s (107.79 MiB/s)
collected 10000 pieces in 2.415327668s (245.07 MiB/s)

Th3Van.dk

5 Likes

Thats a brilliant idea. Could you make that code change a pull request? I think for legal reasons we can’t just copy your code change and have to follow the pull request process.

2 Likes

Tested on a live system ( HDD not, but SSD and general system) in a jail ( why --disable-sync doesnt matter (?) in my results ) , current ZFS RAM cache at 25GB (12 free)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( Sync, atime, checksum, 128K )

uploaded 10000 pieces in 33.908743178s (17.46 MiB/s)
collected 10000 pieces in 3.242856219s (182.53 MiB/s)
 --disable-sync
uploaded 10000 pieces in 31.030909737s (19.08 MiB/s)
collected 10000 pieces in 3.017162922s (196.19 MiB/s)

Log getting hit, cache not ( guessing read from memory )

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( atime, checksum, 128K )

uploaded 10000 pieces in 3.35861653s (176.24 MiB/s)
collected 10000 pieces in 2.144224621s (276.06 MiB/s)

 --disable-sync
uploaded 10000 pieces in 3.135761907s (188.77 MiB/s)
collected 10000 pieces in 2.058406857s (287.57 MiB/s)

Log no hit, cache no hit ( guessing read/write to/from memory )

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 128K )

uploaded 10000 pieces in 3.264675207s (181.31 MiB/s)
collected 10000 pieces in 2.020443071s (292.97 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.130080354s (189.11 MiB/s)
collected 10000 pieces in 1.910223777s (309.87 MiB/s)

Log no hit, cache no hit ( guessing read/write to/from memory )

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 8K)

uploaded 10000 pieces in 3.603523714s (164.26 MiB/s)
collected 10000 pieces in 2.148112986s (275.56 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.393355051s (174.44 MiB/s)
collected 10000 pieces in 2.364230955s (250.37 MiB/s)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 16K)

uploaded 10000 pieces in 3.415104117s (173.33 MiB/s)
collected 10000 pieces in 2.159593704s (274.09 MiB/s)

 --disable-sync
uploaded 10000 pieces in 3.265120535s (181.29 MiB/s)
collected 10000 pieces in 2.120292834s (279.17 MiB/s)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 32K)

uploaded 10000 pieces in 3.198371463s (185.07 MiB/s)
collected 10000 pieces in 1.97533838s (299.66 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.214157027s (184.16 MiB/s)
collected 10000 pieces in 1.971362811s (300.26 MiB/s)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 64K)

uploaded 10000 pieces in 3.118440809s (189.81 MiB/s)
collected 10000 pieces in 1.994945083s (296.71 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.054073912s (193.82 MiB/s)
collected 10000 pieces in 1.867943954s (316.89 MiB/s)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 256K)

uploaded 10000 pieces in 3.361925577s (176.07 MiB/s)
collected 10000 pieces in 1.940495814s (305.04 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.24679706s (182.31 MiB/s)
collected 10000 pieces in 1.932856583s (306.24 MiB/s)

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 512K)

uploaded 10000 pieces in 13.758766129s (43.02 MiB/s)
collected 10000 pieces in 2.812922737s (210.43 MiB/s)
 --disable-sync
uploaded 10000 pieces in 12.942542327s (45.73 MiB/s)
collected 10000 pieces in 2.717829207s (217.79 MiB/s)

Had cache hits

WDC WD8001PURP-74B6RY0 with 1Gb Log and 10Gb cache ( slice from a Samsung SSD 970 PRO 1TB ) on ZFS ( checksum, 1M)


uploaded 10000 pieces in 34.318900473s (17.25 MiB/s)
collected 10000 pieces in 3.711020785s (159.51 MiB/s)
 --disable-sync
uploaded 10000 pieces in 33.610367713s (17.61 MiB/s)
collected 10000 pieces in 4.016253385s (147.38 MiB/s)

Had cache hits

Removing cache drives:

WDC WD8001PURP-74B6RY0 on ZFS ( atime, checksum, 128K)

uploaded 10000 pieces in 3.184873093s (185.86 MiB/s)
collected 10000 pieces in 1.918729018s (308.50 MiB/s)
 --disable-sync

uploaded 10000 pieces in 3.127136178s (189.29 MiB/s)
collected 10000 pieces in 1.89322634s (312.65 MiB/s)

WDC WD8001PURP-74B6RY0 on ZFS ( checksum, 128K)

uploaded 10000 pieces in 3.154783959s (187.63 MiB/s)
collected 10000 pieces in 2.006969385s (294.94 MiB/s)
 --disable-sync
uploaded 10000 pieces in 3.142660694s (188.35 MiB/s)
collected 10000 pieces in 1.918044302s (308.61 MiB/s)

For zfs with plenty of memory it seems disable atime and sync is all you need

This is too technical for me, but what are the risks keeping fsync off?
I don’t realy understand what it does, and what’s the difference between on and off for the storagenode.
Those modifications whould increase the I/O to the db files? We already saw some db locks from operators with slow systems (HDD on USB and low RAM, etc.). This means that we all need to move db files on SSD? What about the systems that lack the option?
Maybe a testing tool for sqlite db I/O whould be useful…
Aren’t txt files quicker and more reliable than db-es?

Using fsync means Storj is waiting for the storage/HDD to confirm writing some data to disk before it proceeds: it’s basically a promise the data is safe: even if the power suddenly got cut off.

When fsync is off that means the OS/disk/caching-layer immediately tells Storj “I got this” and will write to the disk when-it-can/when-its-fast… so Storj can immediately proceed with whatever it had planned next. However that data may only be held in memory or a caching layer waiting to be written… so it could in theory be lost if the power went out.

There are SSD-caching and journalling and HBA-battery-backed schemes of various sorts to protect from data loss during power failures… so you can typically skip fsync… but the fsync option is still the safest (but slowest).

To me… the way Storj works… with encoded redundancy… even if a node lost power and a couple baby files didn’t make it to the HDD… nothing of value would be lost. Like maybe the node had a 1-in-a-million chance of failing a 1-in-a-million audit… but still there are up to 80 other copies of that data out there to rebuild from. So it doesn’t make sense for SNOs to eat the fsync=on performance penalty if customers don’t benefit from increased resiliency.

4 Likes

I see. At first glance, you would think that you can be DK for loosing pieces, but it’s hard to reach 4% of data loss by power outages, on a stable electrical network.
But, the best way would be to make it optional. Who knows what setup has and how reliable is, he can optin. Who has problems with sudden restarts or shutdown, can optout.

You need Go installed to compile the tool. I think you should just wait for the eventual binaries.

2 Likes

What is the actual risk of disabling sync? At most, you’d lose whatever pieces were in-flight, which couldn’t be enough to fail an audit alone, right?

we can’t disable fsync now. You need to modify the code…

I do not have much time to prepare a PR right now, but if you’re willing to wait 2-3 weeks, I’ll do so. If you want it faster, then…

ɪ ʜᴇʀᴇʙʏ ᴅᴇᴄʟᴀʀᴇ ᴛʜᴀᴛ ɪ ᴀᴍ ᴛʜᴇ ᴀᴜᴛʜᴏʀ ᴏꜰ ᴛʜᴇ ᴄᴏɴᴛᴇɴᴛꜱ ᴏꜰ ᴀ ᴘᴀᴛᴄʜ ꜰɪʟᴇ ᴀᴛ ʜᴛᴛᴘꜱ://ɢɪꜱᴛ.ɢɪᴛʜᴜʙ.ᴄᴏᴍ/ʟɪᴏʀɪ/4668ᴇ511ᴀ3ᴀꜰᴇᴄ7ᴄᴅ14ᴅ5069ʙ8ʙ75927, ᴀɴᴅ ɪ ᴅᴇᴅɪᴄᴀᴛᴇ ᴛʜɪꜱ ᴡᴏʀᴋ ᴛᴏ ᴛʜᴇ ᴘᴜʙʟɪᴄ ᴅᴏᴍᴀɪɴ, ᴏʀ ɪɴ ᴊᴜʀɪꜱᴅɪᴄᴛɪᴏɴꜱ ᴡʜᴇʀᴇ ᴛʜɪꜱ ɪꜱ ɪᴍᴘᴏꜱꜱɪʙʟᴇ, ɪ ᴀᴍ ᴍᴀᴋɪɴɢ ɪᴛ ᴀᴠᴀɪʟᴀʙʟᴇ ᴛᴏ ᴇᴠᴇʀʏᴏɴᴇ ᴜɴᴅᴇʀ ᴛʜᴇ ʀᴜʟᴇꜱ ᴏꜰ ᴄʀᴇᴀᴛɪᴠᴇ ᴄᴏᴍᴍᴏɴꜱ ᴄᴄ0 1.0 ᴜɴɪᴠᴇʀꜱᴀʟ ᴘᴜʙʟɪᴄ ᴅᴏᴍᴀɪɴ ᴅᴇᴅɪᴄᴀᴛɪᴏɴ ʟɪᴄᴇɴꜱᴇ.

Also, there’s a potential issue with the patch, as I am not 100% sure that the number in the order is exactly the bandwidth accounted per usual rules. I suspect it is, but in case it isn’t, a small change to the order file format would be necessary.

BTW, seeing the numbers posted here I’m worried that there is no way to replicate insufficient RAM to keep metadata cached.

3 Likes

@Toyoo - if you could fill out Storj Labs Software Grant and Contributor License Agreement v2 (“Agreement”) we can turn your patch file into a Github commit with attribution to you. I like your dedication to the public domain, thank you, though I think it will make our bookkeeping a bit easier if you fill out the above form.

4 Likes

The steps to compile it and use it under Windows are the same. You need a Go install, and perhaps devtools, or you may use these steps to install dependencies:

then from the PowerShell:

and run it in the PowerShell:

cd x:\storagenode
rm -Recurse *

they are not copies, each piece is unique, but only few any of them are enough to reconstruct the file, this is a beauty of the Erasure Coding.

1 Like

I would like to say that having local stats, not based on what the satellite reports is an important tool to verify and compare. Both to show Storj pays reliably and correctly, as well as to check that there aren’t any issues with order sending. In my opinion this should never be replaced with satellite stats only. I saw the discussion on adding it to the db during order sending and I think that’s a good idea, as long as it’s added to the db even if the order sending fails for some reason.

6 Likes

Performance improvements are all merged. Time for one final update to keep you all informed.

Let’s start with the important information first. There is now a feature flag for nodes that require security over performance. --filestore.force-sync=true

Also I am not sure how accurate the bandwidth tracking on the storage node dashboard will be. Once upon a time we had the problem that the dashboard was showing the order size instead of the transfered bytes. I would expect a similar situation again. If you see a broken dashboard in production it is not because we didn’t test it. It is because we already made the decision that a broken dashboard is still worth the performance gain. We can worry about the dashboard later. As a workaround I would suggest to use grafana instead. That should still show the correct value.

We also need to increase the free space buffer from 500MB to 5GB. Because of the TTL nodes will switch between full and free space more frequently. Combined with an higher upload rate a 500MB buffer would give the customer a bad experience. The customer will get a decent number of failed uploads. By increasing the free space buffer we can make sure the storage node can accept 100MBit/s for 5 more minutes after passing the threshold. I am not saying that a customer would ever upload such an amount but technically the storage nodes are capable of handling it. So we can as well increase the free space buffer to make it possible for both sides.

Now one final request if you don’t mind. We would love to get some final benchmark results especially from systems that we haven’t tested yet. For example a Windows node or some strange file system. Most if not all of our internal tests have been ZFS and ext4.

git clone https://github.com/storj/storj
cd storj
git fetch https://review.dev.storj.io/storj/storj refs/changes/99/13099/3 && git checkout FETCH_HEAD
go install ./cmd/tools/piecestore-benchmark/

For executing this version of the benchmark you don’t need any extra flags. Soon this final benchmark will get merged onto the main branch. The idea is that this benchmark can be used to test out different setups.

In terms of target performance we have blasted away all of our goals. The performance gain is higher than needed and even higher than we ever dreamed of. Seriously you have to test this out on your machine otherwise you will not believe it. Here is what my slowest hard drive can handle now:

# old code
piecestore-benchmark  -pieces-to-upload 100000
uploaded 824633807224 pieces in 24m8.519875266s (4.09 MiB/s)
# new code with ZFS sync=standard
piecestore-benchmark  -pieces-to-upload 100000
uploaded 100000 pieces in 29.534088576s (200.42 MiB/s)
collected 100000 pieces in 16.178403936s (365.87 MiB/s)
# new code with ZFS sync=disabled
piecestore-benchmark  -pieces-to-upload 100000
uploaded 100000 pieces in 23.438063319s (252.55 MiB/s)
collected 100000 pieces in 14.063208782s (420.90 MiB/s)
10 Likes

I can’t wait to see this in action: it will mean I’ve finally filled a HDD! :wink: