Hi, guys! I want to thank you for the overwhelming amount of feedback. I didn’t expect it, and it’s really delightful and motivating! I continue to publish translated posts from my blog.
How to destroy 100 tb of data
After I finally managed to get off the first server, I started actively configuring the hardware in Hetzner and, in my opinion, the most successful series for storage is the SX13x. It has almost everything I need: a large amount of RAM, two fast NVMe drives for the system and metadata, and a whopping 10 hard drives at 16 terabytes each. From the very beginning, I became addicted to using Proxmox, although I had no idea what that system was before getting the storj server.
By the time the mining boom for Chia on hard drives started, I had accumulated around 120 terabytes of data in the storj network, and I was in the midst of migrating data from one of the old servers to the new one. This year was marked by two major failures for me:
-
I managed to corrupt 100 terabytes of storage data during the migration.
-
Missed opportunities during the Chia frenzy.
How I corrupted 100 terabytes:
Nothing foreshadowed trouble. As usual, I was migrating nodes in a semi-manual mode, experimenting with different parameters of ZFS send/receive. I quickly noticed that using the default parameters, ZFS send generated much more traffic than what was actually written. In some cases, the volume of transmitted data was three to four times larger than the actually written data. This was unsatisfactory for me because monitoring progress based on such data was simply impossible. After reading the documentation, it was discovered that the combination of ZFS send parameters “-Lec dataset@snapshot” generates a data volume that practically matches the “written” property of the dataset. Now, the snapshot sending progress reflected reality quite accurately. The process of sending the first snapshot takes a considerable amount of time during which a significant delta accumulates. In my case, transferring the first snapshot took 3-4 days, depending on the size of the node. Smaller nodes, of course, moved faster. Here’s what I did: I launched a batch of 3-4 nodes to fully utilize the gigabit connection, waited for the successful completion of the initial data transfer, stopped the nodes, started the delta, and once it finished, I started the nodes on the new host. The downtime usually lasted a couple of hours, plus or minus.
After verifying through the logs that the nodes started successfully and there were no errors, I initiated the migration of the next batch. This went on for about two weeks, and nothing foreshadowed trouble… until one evening I received a bunch of disqualification emails. I couldn’t believe my eyes. I started investigating - the audits were catastrophically failing on the first batch of nodes, some nodes had successful audits, and on others, the values were on the verge. It was clear that this was just the beginning of an avalanche. An apocalypse awaited me. Every day, I received more and more disqualifications. Everything I had been working on for the past year and a half was crumbling before my eyes, and I couldn’t do anything about it. It was a complete disaster.
I posted on the forum, reached out to @Alexey, went through the logs to check for failed audits - everything was clean. There were no failed audits. I started talking to the guys in the telegram’s ZFS channel (there are many amazingly helpful people there, I recommend it). I couldn’t figure out what the problem was. There were no data transfer errors, and the snapshots were transmitted without issues. Scrubbing the pools also showed no problems. I started reproducing the migration once again to identify the issue and came across a reproducible bug when sending an incremental snapshot with those infamous “-Lec” parameters.
Here’s what happened in the end: since the migrating batch consisted of nodes of different sizes, the smaller nodes had accumulated a more significant delta, which, after the incremental snapshot, contained more corrupted data. It looked like this: two servers with identical datasets and snapshots, on the source, all files at the top-level and under the snapshot were readable and undamaged, but on the receiving server, after the incremental snapshot, all the files that had changed during the accumulated delta were corrupted (they were still intact under the first snapshot). The nodes that migrated faster and waited for larger ones to complete disqualification were disqualified first, followed by the larger ones, and so on. By the time the disqualifications began, it was impossible to do anything because the source had already been removed. Two weeks had passed from the completion of the first migration to the first disqualification.
Special thanks to everyone who supported me during this difficult time. Some even shared resources from their backup fund. I want to say a huge thank you, guys, you really supported me. I wouldn’t say I was devastated, but it was still very sad. Out of 120 terabytes, only 20 remained after the migration.
The Chia frenzy was ahead… It was an exhilarating time, and I was extremely motivated. I made several new acquaintances and gained very interesting experience. The storj fiasco quickly lost its relevance, and I fully immersed myself in Chia…
Next post in my blog describes my adventures in the realm of Chia. I’m not sure if it’s worth publishing the translation here since it’s not directly related to storj. However, many of us, if not all, are enthusiasts in one way or another, and perhaps many would be interested in reading about petabyte-scale plotting and everything related to it. If you’re interested, let me know with a reaction. Thank you!