currently I’m facing a serious problem with one of my STORJ-Machines. When I set up everything it runs fine for hours. Then something unknown happens and it crashes the entire system. The htop rises to 200, then 400 and the server is not accessible trough ssh anymore.
dmesg throws the following error message:
Does anybody know what to do or how to identify the issue?
Thanks and kind regards,
Random systemtests GO!
-check all filesystems, ram, cpu
-driver compatibility, hardware compatibility, chipset driverupdates, even the graphics driver update.
-Bios update already done?, i think.
-broken cables, SMART, and im out of ideas.
looks like a BSOD to me.
It’s not BSOD. It run’s for hours and at some given point, like this ext4_mb_regulator event it hangs up and crashes. I’ve done alot of random systemtests. Am looking for something more precise.
btw. your mainboard is not officialy linux compatible do you have/know one where it works?
Disable ultra fast boot and enable CSM ?
Looks like an incompatibility issue with some controller. It just happen during work of ext4 driver.
You may try to check and fix corrupted filesystems on your disks, but looks like a kernel panic will not went away.
The thing you may try is to use a more fresh kernel (or the older), but I guess it would be easier on Ubuntu, than on Debian.
It also possible that something is corrupted on the system volume, this usually could happen on SD cards or similar crappy storage, in this case it could help to reinstall every single packet (starting from everything related to ext4) or re-flash the system.
Probably it is the M.2 NVMe. I did various re-flashes and new installations of Debian 12.2. Yesterday evening I changed the SSD to another M.2 NVMe and so far it looks fine. Will have to wait if it runs for days.
Sometimes a firmware update of the SSD might fix it.
It seems like I got it fixed. No errors for over 24 h. Was then likely some RAM-issue, had to be cleaned and refitted.