Past week several developers looked into storage node performance. We started the week with a benchmark tool that creates all the files a storage node would do at maximum speed. How fast can we write pieces to spinning disks? We know there are some low hanging fruites that would give us better storage node performance. How big is the performance gain? Which of the possible code changes gives us the biggest improvement? Is it worth the risk? So here I am showing you some early results of that work including a benchmark tool you can run on your storage node.
-
Writing a piece to disk comes with an overhead. The storage node writes the piece into a temp folder first, renames and moves it into the final location and tells the filesystem to sync the piece to disk. We can shorten that process. Let’s write the piece once to the final location and don’t call fsync. This gives us an impressive performance improvement but without the fsync call losing power is a risk. → It will get implemented as a feature flag. By default storage nodes will take this shortcut but storage node operators can opt in to the old behavior with the extra safety the fsync call provides.
-
Bandwidth tracking with an SQLite DB. Every upload writes into that SQLite DB. I don’t think we have come to a final conclusion yet. There have been talks about removing the bandwidth tracking from uploads. In that case storage nodes wouldn’t see that data on the storage node dashboard anymore. Later we could replace it with usage data provided by the satellite.
Another attempt would be to batch the writes into the SQLite DB. If we are lucky that gives us similar performance gains and we can keep it. Further benchmarks needed to make a decision. -
TTL is also a SQLite DB. Not every upload has a TTL but the usecases we are looking into will use it more frequent. Again each TTL is a write into that SQLlite DB plus a cleanup job that later deletes these pieces. There have been talks to remove the TTL DB and use garbage collection instead. Don’t panic. I already provided the feedback… I believe garbage collection is already off the table. For now lets say further benchmarks are needed and I would expect that garbage collection is too expensive to use it for a TTL heavy usecase. I believe the latest benchmark is also batching the writes into the SQLite DB. I take that as further evidence that we can keep the advantages of a TTL DB over garbage collection.
Another alternative would be to use flat text files instead of SQLite similar to how the storage node manages orders nowadays. I don’t know how far we tested this alternative nor how difficult it would be to implement it.
That is the top 3 so far. There are a few more experimental changes. I will try to post updates from time to time. However posting here isn’t my first priority so you will get the informations with a delay. My first priority will be to run the benchmark tests on my storage node and keep the feedback loop as fast as possible. The faster I can return some numbers the more time the developers have to improve it. I still hope can find a few minutes here and there to keep you updated.
Ok now to the benchmark tool that you are all waiting for. We don’t have binaries and it isn’t even merged. The pull request still gets updates frequently. So the installation is a bit more tricky this time:
git clone https://github.com/storj/storj
cd storj
git fetch https://review.dev.storj.io/storj/storj refs/changes/53/13053/16 && git checkout FETCH_HEAD
go install ./cmd/tools/piecestore-benchmark/
And now the execution:
cd /mnt/hdd
mkdir benchmark
cd benchmark
piecestore-benchmark -pieces-to-upload 10000
# copy the result and potentially look into the tracefile the tool creates
rm -r *
piecestore-benchmark -pieces-to-upload 10000 --disable-sync
# the expected performance gain with fsync disabled. Again the tracefile can be interresting.
You might have to start with a lower piececount. My target was to hit something between 1 and 10 minutes of runtime.
For fun I also tried different ZFS settings. The benchmark tool is really useful to try out different settings. But this first version of the benchmark isn’t the final one. I will have to repeat optimizing my system with later more accurate versions of the benchmark. One more detail. This benchmark is using TTLs for all uploaded pieces. So the results it gives you are not representing your current performance. Same goes for the maximum performance gain.
There are also some other params to play with. I still need to run the benchmark with higher concurrency and different piece sizes. Just to question if the expected performance gain will be the same across the spectrum or worst case the benchmark is too far off from the concurrency or piece size a production node would see. By testing out some extreme values I will get better feeling which params have an measurable impact on performance and which don’t.