Hi Ilya, It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe does. It is used intensively by ceph-disk and inevitably leads to races where a device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 with a 3.13.0-62-generic kernel only generates two udev change events and does not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 0.8.8 did not change in a significant way and the output of strace -e ioctl sgdisk -i 2 /dev/vdb is identical in both environments. ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0 ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKGETSIZE, 20971520) = 0 ioctl(3, BLKGETSIZE64, 10737418240) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 ioctl(3, BLKSSZGET, 512) = 0 This leads me to the conclusion that the difference is in how the kernel reacts to these ioctl. What do you think ? Cheers On 17/12/2015 17:26, Ilya Dryomov wrote: > On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary <loic@xxxxxxxxxxx> wrote: >> Hi Sage, >> >> On 17/12/2015 14:31, Sage Weil wrote: >>> On Thu, 17 Dec 2015, Loic Dachary wrote: >>>> Hi Ilya, >>>> >>>> This is another puzzling behavior (the log of all commands is at >>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a >>>> series of sgdisk -i commands to examine various devices including >>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup >>>> again although I don't have a definitive proof of this). >>>> >>>> It looks like a side effect of a previous partprobe command, the only >>>> command I can think of that removes / re-adds devices. I thought calling >>>> udevadm settle after running partprobe would be enough to ensure >>>> partprobe completed (and since it takes as much as 2mn30 to return, I >>>> would be shocked if it does not ;-). > > Yeah, IIRC partprobe goes through every slot in the partition table, > trying to first remove and then add the partition back. But, I don't > see any mention of partprobe in the log you referred to. > > Should udevadm settle for a few vd* devices be taking that much time? > I'd investigate that regardless of the issue at hand. > >>>> >>>> Any idea ? I desperately try to find a consistent behavior, something >>>> reliable that we could use to say : "wait for the partition table to be >>>> up to date in the kernel and all udev events generated by the partition >>>> table update to complete". >>> >>> I wonder if the underlying issue is that we shouldn't be calling udevadm >>> settle from something running from udev. Instead, of a udev-triggered >>> run of ceph-disk does something that changes the partitions, it >>> should just exit and let udevadm run ceph-disk again on the new >>> devices...? > >> >> Unless I missed something this is on CentOS 7 and ceph-disk is only called from udev as ceph-disk trigger which does nothing else but asynchronously delegate the work to systemd. Therefore there is no udevadm settle from within udev (which would deadlock and timeout every time... I hope ;-). > > That's a sure lockup, until one of them times out. > > How are you delegating to systemd? Is it to avoid long-running udev > events? I'm probably missing something - udevadm settle wouldn't block > on anything other than udev, so if you are shipping work off to > somewhere else, udev can't be relied upon for waiting. > > Thanks, > > Ilya > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature