RE: Why does old kernel boot when new kernel installed?

Nico Morrison <nico.morrison@micronicos.com> · Thu, 6 Feb 2003 14:30:01 -0000

Hello Theodore Ts'o,

Thank you for your points & I take your comments about being professional
...... we DO have a spare machine all setup & ready to go, the fly in this
ointment is the secure server certificate, under which several users are
running their small shops ...... otherwise I'd ne thinking of moving the
server & starting again.

I am forwarding this to our techies & will see.

Regards,
Nico Morrison
nico.morrison@micronicos.com
___________________________________________
Micronicos Limited  -  London, UK.
Tel: +44 20 8870 8849 Fax: +44 20 8870 5290
___________________________________________

From: Theodore Ts'o [mailto:tytso@mit.edu]
Sent: 06 February 2003 14:25
To: Nico Morrison
Cc: 'Juri Haberland'; ext3 users list
Subject: Re: Why does old kernel boot when new kernel installed?

On Thu, Feb 06, 2003 at 01:30:15PM -0000, Nico Morrison wrote:
> [root@ns5 boot]# df -k
> Filesystem           1k-blocks      Used Available Use% Mounted on
> /dev/md0              36463784   5642076  28969420  17% /
> none                    510400         0    510400   0% /dev/shm
> 
> Where /boot is ALSO on the RAID1 partition ( this must have been a mistake
> at setup time ..... although the machine works fine apart from a LOT of
> kjournald activity (up to 60% CPU!).)
> 
> Could this be causing GRUB not to see the other kernels & if so what can
we
> do?

Um, that would be yes, very likely. 

The big question at this point is how GRUB was actually configured at
installation time.  It is either using a "preset-menu" embedded into
it at install time (which it uses if it cannot find the configuration
file), or the configuration file, depending on where it was defined to
be when GRUB was installed, is somewhere else.

If you are right in assuming that the configuration file on all of
your machines are otherwise identical, and your Linux/Unix
"professionals" didn't perform other improvisations when they
installed that particular server, then creating a /boot filesystem on
/dev/hda1 like the other systems, and populating it with the
appropriate files, and then rebooting, *may* fix the problem for you.

Or if you're really lucky, /boot already exists in /dev/hda1, but it
wasn't mounted, and once you mount it, you can re-install the newer
kernel, and update the /boot/grub/menu.lst found in /dev/hda1's
filesystem, and you're good to go.

However, a good system administrator, over the years, becomes a
paranoid s.o.b.  Fortunately, the worst case in performing this
particular test would be a reboot; creating or modifying the /boot
partition in /dev/hda1, will, in the worst case, simply result in it
being ignored by grub.  If that doesn't work, however, the next thing
to recommend would be to reinstall grub, or if at this point your
faith that the system was properly installed, and you are concerned
that there may be some other deviances between the "as designed" and
"as built" of your server, would be to save the data disks, and
rebuild and reconfigure your server from scratch.

> This is a busy public server with several 100 users ......... we
> have to be very careful doing anything.
>
>Our tech support are Linux/UNIX professionals & are baffled - I am hoping
>for some help here, I am emailing as they don't have the time, look after
>over 100 servers, we only run 12 so I try to dig ....

As professionals, especially if they are maintaining a large scale
site with as many machines as you mentioned, I'm sure they designed
and implemented installation scripts so that server machines are
easily replicable, and can be rebuilt on a moment's notice.  So
rebuilding the system software on your server machine should be
something that should be doable very easily.  Better yet, they should
be able to have spare machines on which you can rebuild the system
software from scratch, and where you can test to make sure the machine
boots correctly, etc., and then afterwards, you can schedule downtime,
pull the data disks from suspect server, and then install them in the
replacement server, and restore service with very minimal downtime.

What, you say you aren't using separate disks and filesystems to
separate the system software from the user/application data?  And you
don't have turnkey scripts that allow you to rebuild the system
software of your servers in a repeatable and less error-prone fashion?
You *did* say you had professionals in your employ, right?  :-)

Seriously, there are some really basic, fundamental principles of
sound, large-scale system administration that are not being followed,
and the fact that you are using a single gigantic root partition and
are co-mingling system and user data is just one sympom of the fact
that very likely your system administrators are breaking a good number
of these fundamentals.  The one good thing about the current state of
the economy is there are a lot of really good, experienced system
administrators who can understand how to design systems that are
robust and which can be easily serviced and maintained.  I would
seriously suggest that you consider bringing one of them on board as a
member of your team.

							- Ted

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users