Server fails to boot

Rob Kampen <rkampen@xxxxxxxxxxxxxxxxx> · Mon, 8 Jul 2019 23:28:05 +1200

First some history. This is an Intel MB and processor some 6 years old, 
initially running CentOS 6. It has 4 x 1TB sata drives set up in two 
mdraid 1 mirrors. It has performed really well in a rural setting with 
frequent power cuts which the UPS has dealt with and auto shuts down the 
server after a few minutes and then auto restarts when power is restored.

The clients needed a Windoze server for a proprietary accounting package 
they use, thus I have recently installed two SSD drives (500GB each) 
also in a raid 1 mirror and installed CentOS 7 as the host and also 
VirtualBox running Windoze 10. The hard drives continue to hold their 
data files.

This appeared to work just fine until a few days ago. After a power cut 
the server would not reboot.

It takes a while to get in front of the machine, add a monitor, keyboard 
and mouse only to find:

Warning: /dev/disk/by-id/md-uuid-xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx 
does not exist

repeated three times - one for each of the /, /boot, and swap raid 
member sets along with a

Warning: /dev/disk/by-uuid/xxxxxxxx:xxxxxxxx:xxxxxxxx:xxxxxxxx does not 
exist

for the /dev/md125 which is the actual raid 1 / device.

The system is in a root shell of some sort as it has not made the 
transition from initramfs to the mdraid root drive.

there are some other lines of info and a txt file with hundreds of lines 
of boot info, ending with the above info (as I recall).

I tried a reboot - same result, rebooted and tried an earlier kernel - 
same result, tried a reboot to the recovery kernel and all went well. 
System comes up, all raids sets are up and in sync - no errors.

So, no apparent H/W issues, no mdraid issues apparently, but none of the 
regular kernels will now boot.

a blkid shows all the expected mdraid devices with the uuids from the 
error message all in place as expected.

I did a yum reinstall of the most recent kernel as I thought that may 
repair any /boot file system problems - particularly initramfs, but no 
difference, will not boot, same exact error messages.

Thus I now have it running on the recovery kernel, with all the required 
server functions being performed, albeit on an out of date kernel.

Google has one solved problem similar to mine but the solution was 
change the BIOS from AHCI to IDE - that does not seem correct as I have 
not changed BIOS, although I have not checked it at this time.

Another solution talks about a race condition and the md raid not being 
ready when required during the boot process and thus to add delay in the 
kernel boot line in grub2. Although no one indicated this actually worked.

Another proposed solution is to mount the failed devices from a recovery 
boot and rebuild initramfs. Before I do this I would like to ask those 
that know a little more about the boot process, what is going wrong? I 
can believe the most recent initramfs being a problem, but all three 
other kernels too?? Yet the recovery kernel works just fine.

As the system is remote, I would like some understanding of what's up 
before I do any changes - if a reboot occurs and fails, it will mean 
another trip.

Oh, one other thing, it seems the UPS is not working correctly, thus it 
may not have shut down cleanly. Working to replace batteries in the UPS.

TIA for your insight.

_______________________________________________
CentOS mailing list
CentOS@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos