Oh NOOO! - Bloody Updates - Grub broken

So I’ve just come back from a three week visit to my parents. 1st day home, OK… I’ll update my VM’s as I’ve been away.

(Note: I’ve never updated my Storj server before and in hindsight maybe I shouldn’t of bothered today either…)

What happens when the VM restarted? I’m dumped in GRUB environment…

I tried a google for possible solutions, telling me to issue ‘set’ command to see the environment variables, ‘ls’ to list the directories in the available drives but here’s the problem. I followed these ‘commands’ and may have made things worse?

Commands/Instructions I followed - https://phoenixnap.com/kb/grub-rescue

Screenshot1 - https://i.postimg.cc/fRJ9wJ7w/ss1-set.jpg
Screenshot2 - https://i.postimg.cc/jSmJCtvy/ss2-brokenit.jpg
Screenshot3 - https://i.postimg.cc/kML4xfWd/ss3-Ready-For-New-Instructions-Please.jpg

So I’m guessing I shouldn’t of ‘exactly’ copied those instructions - especially the /sda line

Any / All help greatly appreciated! Kind regards, ThisLinuxIdiot

I’m not the biggest linux expert myself, but maybe you can provide some more detail on which exact commands you ran?

You may also want to run fsck on your HDD. It doesn’t feel very likely to me that your grub setup has suddenly changed, but file system issues could prevent your partition from mounting and are unfortunately very common.

Thank-you for your swift reply BrightSilence, obviously I’m aware that downtime is frowned upon so really want to try to sort this out ASAP!

https://i.postimg.cc/QMPQ1cbY/SS4-Commands-IRan.jpg

Can I even run fsck from within Grub? All a bit new to me I’m afraid

did you check if the parameters on this one are actually correct for your setup?

Either way, I don’t think those commands made permanent changes.

For fsck I have never ran it from grub. I don’t know if that’s possible. I would use a live DVD/USB. Since you’re using a VM, you can probably mount it in the VM software using an iso. Then pick an option to run your preferred distro without installing and run fsck from there.

I probably can’t be much more help than this. So I’ll leave it to others to respond further.

While that is a good attitude, you do have some time to fix things. Since suspension for down time only kicks in after your uptime falls below 60% in the past 30 days. However, you may lose some data to repair in the mean time. Faster is better, but don’t get too stressed. You have time.

Thanks again BrightSilence, as for checking my variables… to the best of my knowledge…

If I’m honest, I don’t know what I’m supposed to be inputting into the root= variable?

I’ve just tried root=(hd0,gpt2) ro and that didn’t go well either…

https://i.postimg.cc/3Jz5WqMP/Nope.jpg

The problem being, I can follow instruction that’s technically beyond me, but when something goes wrong, I’m at a loss.

That’s for your reassurance regarding the down time, certainly appreciated. Thanks again for you’re replies… hopefully somebody will advise me sooner rather than later. Kind regards.

more info needed…

That looks like Proxmox ? If it’s not proxmox, what are you running and which version, along with disks and how you assign them.

  • at the grub prompt, can we have just what is returned from;

ls

  • then for each partition, a screen of what is returned - will help locating where your root partition has gone, trailing slash needed…

ls (hd0,gpt2)/
ls (hd0,gpt1)/

#edit - sorry looking at your screen shots :stuck_out_tongue: so hd0,gpt2 = sda2 so try;

set root=(hd0,gpt2)
linux /boot/vmlinuz-4.19.0-21-amd64 root=/dev/sda2
initrd /boot/initrd.img-4.19.0-21-amd64
boot

if that boots :stuck_out_tongue: then run

sudo update-grub

4 Likes

Hii

Have you managed to sort this out yet?

which VM environment are you running in ?

the kernel not wanting to run by the looks of things is so vaig from linux

could you check if the boot orders have changed at all in the VM ? has the RAM changed ? any allocated drives to the VM seen as ‘busy’ by the host ? or inaccessible ?

Hass

vfs… hmmm is this a container?
i got some linux containers running which has been very confusing for docker… because i tried running all sorts of different storage drivers in docker, and they always fail… the only thing that works reliably is the VFS driver

and yours seems try to mount VFS so i’m guess it’s a container and not really a vm, not that it makes a ton of difference… nor helps in anyways … sorry

but yeah i digress…

i’ve had good luck running the live cd’s or such installers for solving thing issue.
won’t always work tho…

are you running your vm’s with cache on their storage… that can easily cause stuff like this… if you run stuff like writeback (unsafe) in proxmox vm’s

You can try to boot to an older kernel version and see if that works.
Some Linux OS comes with a recovery kernel version on grub.

Apologies guys I’ve been very busy and this is literally the first chance I’ve had to get online in 3 days… Thank-you all so much for your replies. Your help and guidance is really appreciated.

Gold star goes to CutieePie (living up to his/her name) as you’ve solved it. Thank-you very much.

Yes, a little background, I’m running a ProxMox VM, I’ve not tampered (i.e. hardware changes) with the server in months, and been away myself looking after my father for the past three weeks. As previously stated once I came back home I did decided to fully update all my ProxMox VM’s and, in doing so managed to bugger up my Storj node.

So my final question, will this happen again (down the line or in the future when I update)?

Looking at the code I’ve just ran;

set root=(hd0,gpt2) - Sets the environment variable of root
linux /boot/vmlinuz-4.19.0-21-amd64 root=/dev/sda2 - Sets the kernal version and boot partition??
initrd /boot/initrd.img-4.19.0-21-amd64 - Loads the kernal into ramdisk??
boot - Initiates the boot process

Would that be a kind correct interpretation?

congrats on getting it working again.
running proxmox myself, the only issues i’ve had with updates or such things like random power offs and such, was due to how i ran the storage cache on the vm…
these days i leave it on default, because i just don’t want to deal with the bullshit that happens when running cache / writeback.

i’ve updated my proxmox vm’s and containers many times without issues, but i also had issues… its not always clearly why an issue arises, i can recommend using good storage setups, so that stuff doesn’t linger in RAM to long.

zfs seems really nice for VM storage, using a SSD with PLP for a SLOG device, i like my VM’s stable.

running without cache is a lot slower, so it often ends up depending on the use case.
also using the backup features of proxmox can be quite nice, which is why its often a good rule to keep the boot disk small, so its easy to backup.

that being said, it does seem more like this issue was due to some sort of poor configuration, using the SDA / SDB / SDC… names i don’t really like, as they can change in linux when one starts having many drives…

so yeah i duno what exactly went wrong for you.
but its certain possible that it was something similar, for my zfs pool i’ve been using /dev/disk/by-id/…
names to define my storage devices, that way they will never get mixed up…
don’t think that was what happened… because your configuration seemed “fine” but clearly it wasn’t so somewhere something must have gone wrong.

i mean GPT2 isn’t exactly random data…
one can also use GPT identifiers / names to define physical storage… which is kinda cool, the reason i don’t like using that for ZFS is because, if the drive is corrupted and the GPT name is lost or removed, ZFS will not recognize the storage media and try to reintegrate it.

ofc locking storage media into a pool using hardware identifiers comes with its own issues, but i kinda like that it limits how much damage i can do.
even if it is rather annoying to work with at times, it does make me less able to kill me pools.
because i do stupid and dangerous stuff if it is the easy path forward lol

but i digess, to make a long story short…
make sure your VM disk cache is configured for default, in the Proxmox VM hardware disk tab.
its what has given me the most grief thus far.

maybe CP has a better idea of what happened.
i’m still only like 3 years into this stuff, so far from an expert.

Thanks for your kind reply SGC, yeah I’ve never run caches… well, not since the days of Win95 anyway! My hardware isn’t the greatest (nor is my budget!) so I kinda make do!

https://i.postimg.cc/RhVt3rJ3/ProxMox.jpg

Thank-you for all of your help and guidance CP, I know i’ve a lot to learn, but it better than the alternative (Windows… lol)

OK, so the power went off today… the UPS took the load… until it didn’t

Upon restarting my server I have exactly the same problem…

…the solution still works (thankfully)

But my question is… How can I make this a permanent change so I don’t manually have to keep doing this?

Any help will be greatly appreciated guys, thanks in advance.

Usually update-grub takes care of that, did you run it from shell after you were able to boot into it?

So you’re also saying you didn’t reboot for 10+ months? Or did it work the time between? But now just happened again? In the latter case, of possible, install something like nut service to power down your server in time.

Hi JWvdV and thanks for your reply!

(Grub) update I believe, caused this initial issue in the first place. I’ve been too scared to update since then :exploding_head: So no, I’ve not ran it from shell. Should I? (Sorry, linux idiot here)

Correct, no reboot for 10’odd months. Working fine in-between. My UPS is/was pretty good and I used to get three hours out of it. Not had any significant electrical outages until today. Found my UPS only just lasts an hour now… but I guess that’s to be expected… nothing lasts forever.

So server powered off ungracefully and when I started it back up, and loaded my Storj VM I was faced with the grub prompt.

My server is on 24/7 so shouldn’t need to use that nut service thingie…

You must, in order to prevent happening this issue again. Because it updates the grub script, that’s booting your PC which you’re doing manually now.

1 Like

NUT service can prevent the ungraceful shutdown, by issuing a shutdown command when your UPS is discharged to a certain power level, 15% for example. In order to prevent data loss and so on.

Might have prevented the issue altogether in the first place…

2 Likes

OK, both really interesting suggestions. Thank-you, your advise is greatly appreciated!