I'm not sure what I found, or why it's happening, but I managed to excersize some or another bug in LVM 1.0.5... We use home-rolled scripts for doing our system backups, and one of the steps creates snapshots of our database filesystems, so that we can dump the snapshots to tape and get a consistent backup image. These scripts were misconfigured, and attempted to create a snapshot of a volume on a volume group that did not exist. This machine is running Linux 2.4.19, patched with Broadcomm Gigabit drivers and LVM 1.0.5 (linux-2.4.19-VFS-lock.patch and lvm-1.0.5-2.4.19-1.burpr.patch, generated by running make in /usr/src/LVM/1.0.5/PATCHES). I then compiled and installed the LVM userland tools from the sources. This machine has one volume group, vg00, consisting of a single physical volume, /dev/sda4, which is itself a partition of ~100GB on a hardware RAID-10 array. --->8--[ Cut Here ]--->8-- root@burpr(pts/1):~ 34 # ls -al /dev/vg00 total 47 dr-xr-xr-x 2 root root 232 Oct 2 02:55 ./ drwxr-xr-x 15 root root 46926 Oct 2 02:55 ../ brw-rw---- 1 root disk 58, 5 Oct 2 02:55 dat brw-rw---- 1 root disk 58, 6 Oct 2 02:55 db1 brw-rw---- 1 root disk 58, 7 Oct 2 02:55 db2 crw-r----- 1 root disk 109, 0 Oct 2 02:55 group brw-rw---- 1 root disk 58, 3 Oct 2 02:55 home brw-rw---- 1 root disk 58, 0 Oct 2 02:55 root brw-rw---- 1 root disk 58, 1 Oct 2 02:55 tmp brw-rw---- 1 root disk 58, 4 Oct 2 02:55 u brw-rw---- 1 root disk 58, 8 Oct 2 02:55 unifytmp brw-rw---- 1 root disk 58, 2 Oct 2 02:55 var --->8--[ Cut Here ]--->8-- The command which was errantly run was: --->8--[ Cut Here ]--->8-- lvcreate --size 8G --snapshot --name db1_snap vg01 --->8--[ Cut Here ]--->8-- I got this output: --->8--[ Cut Here ]--->8-- lvcreate -- "/etc/lvmtab.d/vg01" doesn't exist lvcreate -- can't create logical volume: volume group "vg01" doesn't exist --->8--[ Cut Here ]--->8-- That's all well and good, and expected. Well, I saw the backup scripts trying to do this, so I killed them off as cleanly as possible, fixed the configuration, and restarted them. Only now, they got stuck on the first vgscan they tried to run. Running vgdisplay by hand now, I seem to have "lost" 8GB from my vg. vgdisplay shows 8GB less free than should be there if you add up the allocations to all the existing lv's. lvscan segfaults, and vgscan hangs while trying to open /dev/lvm. lvcreate hangs as well. Running strace: --->8--[ Cut Here ]--->8-- root@burpr(pts/1):~ 51 # strace lvcreate --size 256M --snapshot --name unifytmp_snap /dev/vg00/unifytmp vg00 --->8--[ Cut Here ]--->8-- ends up with a hang, and this is the last few lines of the trace: --->8--[ Cut Here ]--->8-- open("/dev/vg00/group", O_RDONLY) = 3 ioctl(3, 0xc004fe05, 0x80a40b8) = 0 close(3) = 0 stat64("/dev/lvm", {st_mode=S_IFCHR|0640, st_rdev=makedev(109, 0), ...}) = 0 open("/dev/lvm", O_RDONLY) = 3 ioctl(3, 0x8004fe98, 0xbfffec22) = 0 close(3) = 0 stat64("/dev/lvm", {st_mode=S_IFCHR|0640, st_rdev=makedev(109, 0), ...}) = 0 open("/dev/lvm", O_RDONLY) = 3 ioctl(3, 0xff00 <unfinished ...> --->8--[ Cut Here ]--->8-- The <unfinished ...> is when I gave up after 5 minutes and hit <control>-c. I have complete straces available of vgscan, lvscan, and lvcreate, as well as the output of lvdisplay for each of the lv's I've got. I also have a core file for lvscan, if that would help, too. We are going to reboot the server over lunch today, hopefully that will clear out whatever kernel structures are gorked, but I'm really not happy that this happened in the first place, and hope someone here can point me to an answer. The hardware is a Dell PowerEdge 6600 with PERC3/DC RAID controller (LSI MegaRAID), 6 15krpm 36GB disks in a RAID-10, 8GB memory, four 1.6GHz Xeon CPUs. Running SuSE Linux Enterprise Server 7 (essentially a stripped-down SuSE 7.2), kernel.org's 2.4.19 + Broadcom and LVM patches, and LVM 1.0.5. I haven't had any problems yet on another server (PowerEdge 2450, 2x P-III 1GHz, 2GB ram, same kernel & lvm, different raid controller). I've tried to be thourough in my data collection; let me know if there's something more needed to debug this. TIA -- Gregory K. Ade <gkade@bigbrother.net> http://bigbrother.net/~gkade OpenPGP Key ID: EAF4844B keyserver: pgpkeys.mit.edu
Attachment:
signature.asc
Description: This is a digitally signed message part