currently I’m facing a serious problem with one of my STORJ-Machines. When I set up everything it runs fine for hours. Then something unknown happens and it crashes the entire system. The htop rises to 200, then 400 and the server is not accessible trough ssh anymore.
Random systemtests GO!
-check all filesystems, ram, cpu
-driver compatibility, hardware compatibility, chipset driverupdates, even the graphics driver update.
-Bios update already done?, i think.
-broken cables, SMART, and im out of ideas.
It’s not BSOD. It run’s for hours and at some given point, like this ext4_mb_regulator event it hangs up and crashes. I’ve done alot of random systemtests. Am looking for something more precise.
Looks like an incompatibility issue with some controller. It just happen during work of ext4 driver.
You may try to check and fix corrupted filesystems on your disks, but looks like a kernel panic will not went away.
The thing you may try is to use a more fresh kernel (or the older), but I guess it would be easier on Ubuntu, than on Debian.
It also possible that something is corrupted on the system volume, this usually could happen on SD cards or similar crappy storage, in this case it could help to reinstall every single packet (starting from everything related to ext4) or re-flash the system.
Probably it is the M.2 NVMe. I did various re-flashes and new installations of Debian 12.2. Yesterday evening I changed the SSD to another M.2 NVMe and so far it looks fine. Will have to wait if it runs for days.