On Sat, 2008-10-11 at 12:48 -0700, Antonio Olivares wrote: > --- On Sat, 10/11/08, Bruno GARDIN <bgardin@xxxxxxxxx> wrote: > > From: Bruno GARDIN <bgardin@xxxxxxxxx> > > Subject: Rawhide does not boot since 2.6.27-0.398 > > To: "For testers of Fedora Core development releases" <fedora-test-list@xxxxxxxxxx> > > Date: Saturday, October 11, 2008, 10:55 AM > > I am testing rawhide for a few month now but i have problem > > of boot > > since kernel 2.6.27-0.398. My rawhide is a virtual system > > on vmware > > server now in version 2.0. Whenever i try to boot, i got > > the following > > errors at the end : > > Activating logical volumes > > VOlume group "VolGroup00" not found > > Unable to access resume device (/dev/VolGroup00/LogVol01) > > Creating root device > > Mounting root file system > > mount: error mounting /dev/root on /sysroot as ext3. No > > such file or directory > > Setting up other filesystems > > setuproot: moving /dev failed:No such file or directory > > setuproot: error mounting/proc: No such file or directory > > setuproot: error mounting /sys: No such file or directory > > Mount failed for selinuxfs on /selinux: No such file or > > directory > > Switching to new root and running init > > swithroot: mount failed: No such file or directory > > Booting has failed > > Boot works fine with kernel 2.6.27-0.382 but fails also > > with 2.6.27-1. > > I have looked at the thread related to ext4 but i am using > > ext3. I > > have also tried a new mkinitrd on 2.6.27-1 but no change. > > Any idea of > > what the problem could be ? The real source of the problem was much earlier in the messages than what was originally provided. I've been trying to track this down myself. Here is the critical bit of information: Kernel that boots: Loading dm-mirror module scsi 2:0:0:0: Direct-Access scsi target2:0:0: Beginning Domain Validation scsi target2:0:0: Domain Validation skipping write testes scsi target2:0:0: Ending Domain Validation: 1204k scsi target2:0:0 FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) Loading dm-zero module sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Cache data unavailable sd 2:0:0:0: [sda] Assuming drive cache: write through Loading dm-snapshot module sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Cache data unavailable sd 2:0:0:0: [sda] Assuming drive cache: write through sda: sda1 sda2 sd 2:0:0:0: [sda] Attached SCSI disk sd 2:0:0:0: Attached scsi generic sg1 type 0 Making device-mapper control node Scanning logical volumes Reading all physical volumes. This make take a while... Found volume group "VolGroup00" using metadata type lvm2 Kernel that fails: Scanning logical volumes scsi 2:0:0:0: Direct-Access scsi target2:0:0: Beginning Domain Validation scsi target2:0:0: Domain Validation skipping write testes scsi target2:0:0: Ending Domain Validation: 1204k scsi target2:0:0 FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127) Reading all physical volumes. This make take a while... sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Cache data unavailable sd 2:0:0:0: [sda] Assuming drive cache: write through sd 2:0:0:0: [sda] 25165824 512-byte hardware sectors (12885 MB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Cache data unavailable sd 2:0:0:0: [sda] Assuming drive cache: write through sda: sda1 sda2 sd 2:0:0:0: [sda] Attached SCSI disk sd 2:0:0:0: Attached scsi generic sg1 type 0 Activating logical volumes Volume group "VolGroup00" not found Unable to access resume device (/dev/VolGroup00/LogVol01) Note that in the kernel that's failing LVM is starting to scan for logical volumes before the SCSI devices have stabilized! It doesn't find any PV's and so it doesn't find "VolGroup00". That's the killer. Now. Look closely at the one that booted above. What's interspersed with the scsi startup messages? Loading dm-mirror, loading dm-zero, loading dm-snapshot. Then we see the "Scanning logical volumes". Well, guess what. Those modules are not present in the latest kernel as modules. They created enough of a time delay that lvm started after the scsi drivers had settled. Now they are not there, we have a race condition and, if lvm starts to early, lvm can find the drives. The "Scanning logical volumes" is starting exactly where we see the "Loading dm-mirror" in the kernel which does boot. I would call that a smoking gun. What's really interesting (to me) is if you look at the 2.6.26 kernels under F9 (and I'm testing this 2.6.27 kernel under F9 as well as F10). You find ALL of the "Loading dm-*" messages AFTER the scsi drivers have settled. Something has changed here in the 2.6.27 kernels where the scsi drivers either are not settling as fast (debugging messages and checks perhaps) or the insmod is returning sooner (before the drivers have settled) creating this race condition which did not exist at all in 2.6.26. I think this is a bug in mkinitrd and it's not emitting a wait when it's needed in this case. Down in mkinitrd around line 1483 is a check for conditions under which it wants to issue a wait for the scsi to setting. I think that either needs to be made unconditional or at least expanded to include other scsi devices like the VMware ones. I cheated. Up at line 1411 I changed "wait_for_scsi="no"" to "wait_for_scsi="yes"" and rebuild the initrd. Problem goes away and lvm starts AFTER the scsi devices have settled. Another way to do this might be to force the including of usb-storage (mkinitrd --with-usb) which has it's own delay both in loading and settling. > > -- > > BeGe > > > > -- > Try uninstalling the kernel and reinstalling it via yum. > I tried several times and succeeded :) No you didn't. You got lucky and won the race that time. That's the problem with race conditions. Sometimes you win through just plain dumb luck. You may well find it fails on a subsequent reboot and you're screwed. Then again, you may well have changed something else that changes the timing and changes the probability and it then works for you. I'll be filing a bugzilla bug on this later if nobody else gets to it first. > Regards, > Antonio Mike -- Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw@xxxxxxxxxxxx /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!
Attachment:
signature.asc
Description: This is a digitally signed message part
-- fedora-test-list mailing list fedora-test-list@xxxxxxxxxx To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-test-list