Graceful Exit - Wanted to thank you all

Just started GE on my last node - removing final 8TB from my home-lab, so hopefully benefits you all, Happy New Year!!!

Always worked well on my platform, always near or at 100% and have had full nodes.

Wanted to thank everyone on this forum, when I have needed support - truly knowledgable people. Its a great project, will continue to read forums and I maybe back as an operator in the future.

Take care all

CC

14 Likes

Good luck, and definately come check on us every 6 months or so… to see what has been going on!

(And for other people with large nodes they want to GE - remember finding new homes for them instead may make better economic sense :money_mouth_face: )

2 Likes

This is the end
My only friend, the end

Hey, what made you decide to leave?

1 Like

Sell now and come back later makes sense to me. Price level for HDDs is just insane atm, this bubble will burst one day.

2 Likes

My lab is more than just StorJ - I want to take things in new direction, and no longer want the responsibility of always on, 100% uptime and connected for very little reward.

I maintain like a production system, and I miss being able to tinker experiment and break stuff in my lab. I have big plans for 2026 and SNO responsibilities are not part of that in the short term. After the rebuild - maybe I will be back.

CC

8 Likes

I can completely sympathize with the desire to break stuff freely :slight_smile:

4 Likes

I’m selling old used 18TB at around £180. Is that high or low?
I did get ÂŁ120 for an 8TB. That seemed high

1 Like

It used to be between €9 and €10 per TB around here for the last few years for bigger capacities.
If someone was in a hurry to sell and they were SAS for example, you could get them for like €8/TB in bigger capacities (16, 18, 20), many times with a manufacturers warranty.
Usually used SAS were cheaper than used SATA.
The used prices are however also starting to go up, probably as sellers are realizing what is happening.
Those 16TB Tohsibas mentioned in the other thread were €220 a piece new, some years back. Now the chepest 16TB new is €400.
The SATA SSDs I bought for €230 a piece a year and a half back are now over 800.
Refurbed WDs went up by $90 in about a month - a 30% price increase.
So for someone locally 180 pounds is an OK price I would say, considering all of what is happening.

3 Likes

Wow! I just checked the HDD prices! They worth double what I payed for them. They are like wine or rare cars.

5 Likes

Ram: $25/GB.
HDD: $15/TB HM-SMR, $20/TB CMR.

Holy shit on a stick indeed.

2 Likes

My two cents: I’ve slowly converged the other way.

I now like building ultra boring, ultra reliable systems. Stuff that just sits there quietly and does its job for years.

Let others break things. I already did my share of that. I get significantly more satisfaction from building systems that simply don’t break.

Outrageously boring systems are extremely rewarding to me.

Different phases, I guess.

6 Likes

I don’t know much about electronics, but what are the things that brake first, what would be the maximum life span and how to choose the most reliable parts?
I know that capacitors tend to go quicker that other stuff. Chips with bigger transistors are better than smaller ones. And of course, server parts, not consumer parts. This is the most I can think of…

1 Like

I was just driving to SFO. AI billboards everywhere. AI this, ai that, vomit inducing .com effect across the county. Inside the airport — massive flashing displays about the same shit.

This is my favourite topic :slightly_smiling_face:

Short answer: parts matter, but architecture matters more.

Reliability design starts with accepting that failure is inevitable. The question is where, how often, and what happens next.

Hardware:

Most components follow a bathtub curve. Early failures are manufacturing defects. The long flat middle is normal operation. End-of-life failures are wear-out. My policy of only buying used hardware and especially hard drives stems from the desire to stay as far away from the left edge as possible.

Capacitors, fans, HDDs, connectors, and PSUs dominate the tail. Semiconductors rarely die on their own if kept within specs.

Server parts are typically better binned and validated — tighter voltage and thermal margins, longer burn-in, lower allowed defect rates. That reduces early failures and variance. It does not change wear-out physics or make failure go away.

Software:

Most outages are software-induced. Updates, config drift, state corruption, operator error. I don’t update software unless I have to. On topic — TrueNAS Core is thankfully dead, so I can continue using it without being bothered by “updates”. The failures are known and documented. Workarounds are in place. It works.

Software doesn’t wear out, but complexity accumulates. Reliability comes from minimizing moving parts and state.

Resilience:

Assume things will fail.

Design should be focused on making those inevitable failures boring, localized, and recoverable.

Fail fast. Detect early. Avoid undefined states.

Redundancy:

Redundancy only helps if failures are independent.

That’s why I find RAIDZ3 outrageously stupid. If that many disks fail at once, you’re no longer dealing with uncorrelated failures. You’re dealing with a shared cause, and therefore piling on parity past RAIDZ1 mostly buys feel-good math, not actual reliability.

Recovery:

This is where systems actually become reliable.

Rebuilds you can repeat without thinking.

State kept to a minimum, ideally nowhere important.

Backups that have actually been restored at least once.

That’s also why I like TrueNAS. The boot drive does not matter. The entire system config lives in a single SQLite DB. Reinstall (remotely, via IPMI disk mounting feature), restore config, done.

The goal isn’t heroic uptime — instead you want a systems that degrade predictably and return to service without human creativity.

That’s very “boring”. And also satisfying.

1 Like

+1! When I was younger… I’d overclock the CPU as fast as I could. Push the RAM. Push the system clocks. Push the GPU. Maybe apps would occasionally do strange things. Maybe every month or so I’d get a bluescreen. But I didn’t care: I was getting the best-bang-for-my-buck!

But now… shit just has to work… forever. Who cares if I get 10% fewer FPS in the latest games: I know my homelab setups are stable, and I know my desktop won’t crash and take a hundred browser tabs with it :). Stock clocks, stock voltages: no funny stuff.

3 Likes

So after building the dream rig, you put it to tests? Just to verify the assumptions and discover what you missed?

1 Like

Blockquote

Oh I wont be breaking things deliberately - but I do want to re stack everything that has been running very reliably since 2019. Switching VMs off and moving to new solutions. TN will get rebuilt and that’s where I host StorJ

Rather than stressing about repeated extended downtime, im exiting - and doing other stuff with all that disk space. That’s the thing about my hardware; I get to choose how its used.

CC

1 Like

Do you have any interesting projects you’d recommend, or are they private ventures? We’re all looking for diversification, after all.

2 Likes

Yes. But what goes befor tests matters more.

I don’t build adventurous systems and then try to validate them. I pick hardware that is already known to behave. Mature platforms, boring chipsets, boring controllers, boring memory configs. No edge cases. No consumer hardware (Realtek LAN, “gaming” motherboards, goofy SSDs without PLP, etc). JEDEC memory speeds only. No ridiculous XMP profiles or marketing claims.

Then I verify assumptions:

  • RAM: memtest, including rowhammer, multiple passes, long enough to heat cycle a few times. Mixed DIMM populations, reduced speeds. If it only works at the edge, it doesn’t work.
  • Storage: SMART quick check but I don’t give it too much weight. What matters is long burn-in, scrubs under load, resilver tests. Pull a disk on purpose and watch the fallout.
  • Boot: cold boots, warm boots, resets during boot, power loss mid-boot. NVRAM resets, firmware updates, boot order persistence. Repeat with the UPS only partially charged after a previous outage. This is one of the most fragile and under-tested areas.
  • Network: both heavy sustained traffic and very light traffic. Idle links where power management, ASPM, EEE, or firmware bugs tend to surface. Link flaps, retransmits, packet loss anywhere in the path. Driver and firmware versions stay frozen once validated.
  • Services: restart everything repeatedly. Kill things mid-operation. Especially long-running, stateful ops like scrubs, backups, replication, resilvers. That’s where rot accumulates. Nothing should wedge, get into non-resumable state, or fail in some other way.

Most important part, and the one most people miss: power path.
PSUs, distributors, cabling, connectors (eBay molex is a no-go), UPS behavior (no consumer Cyberpower units), brownouts, recovery after loss. Line quality, grounding, shared circuits. Never skimp on power supplies to begin with. No “gaming” PSUs of any kind.

After that I leave the system alone. No firmware updates, driver updates, OS updates, nothing, unless there is a known bug fix that affects me, and only after weighing inconvenience of keeping workaround vs having to do all validation from scratch.

I’m actually relieved that TrueNas Core was killed. Now I don’t have to pay attention to updates — it is already stable and will work forever. Just look at the Scale churn. Half the world is on fire after every new update.

Bottolm line litmus test: if the system needs ongoing attention after a few months, it wasn’t built correctly.

1 Like

I wholeheartedly agree on RAIDz3, but I just can’t figure out what my stand is on RAIDz2.

With ~10modern disks, one could either do a wide RAIDz2, or a pair of striped RAIDz1s. One could also do much more, but these are the two that makes sense to me.

I personally feel that the risk of a second disk failure during rebuild is low, but it’s not zero, and in a RAIDz1 scenario, you’ll always be without any kind of redundancy while rebuild is going on. Unless a drive decides to outright die over night, the argument could be made that if you add the replacement disk to the pool and first remove the bad drive once resilvering has finished, one should be OK, but I don’t know man. I like to be protected at all times, also during rebuild, which points in the direction of RAIDz2

I’m a believer in RAIDZ2/6 (compared to RAIDZ/5). In the scenarios I’ve had failures it wasn’t the time rebuilding… it was the time before the rebuild could begin.

With corporate failures: we sometimes weren’t even able to schedule a visit to the server for over a week. With personal systems, when HDDs were expensive and funds were low: I was waiting for the mailed RMA process to even have a healthy disk to swap-in. Or in both scenarios you may need/want to order an exact replacement (same make/model) and buying online can take a week+.

It was really nice to know we weren’t on-the-edge-of-loss with one dead drive.

Yes we had backups either way… but still… the price of an extra HDD is almost nothing if it saves you from that stress :wink: . For the capability you get, even with recent price increases, HDDs are a cheap!

3 Likes