Zombie OSD filesystems rise from the grave during bluestore conversion

J David <j.david.lists@xxxxxxxxx> · Mon, 4 Nov 2019 12:33:02 -0500

While converting a luminous cluster from filestore to bluestore, we
are running into a weird race condition on a fairly regular basis.

We have a master script that writes upgrade scripts for each OSD
server.  The script for an OSD looks like this:

ceph osd out 68
while ! ceph osd safe-to-destroy 68 ; do sleep 10 ; done
systemctl stop ceph-osd@68
sleep 10
systemctl kill ceph-osd@68
sleep 10
umount /var/lib/ceph/osd/ceph-68
ceph osd destroy 68 --yes-i-really-mean-it
ceph-volume lvm zap /dev/sda --destroy
ceph-volume lvm create --bluestore --data /dev/sda --osd-id 68
sleep 10
while [ "`ceph health`" != "HEALTH_OK" ] ; do ceph health; sleep 10 ; done

(It's run with sh -e so any error will cause an abort.)

The problem we run into is that in about 1 out of 10 runs, when this
gets to the "lvm zap" stage, and fails:

--> Zapping: /dev/sda
Running command: wipefs --all /dev/sda2
Running command: dd if=/dev/zero of=/dev/sda2 bs=1M count=10
 stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.00667608 s, 1.6 GB/s
--> Destroying partition since --destroy was used: /dev/sda2
Running command: parted /dev/sda --script -- rm 2
--> Unmounting /dev/sda1
Running command: umount -v /dev/sda1
 stderr: umount: /var/lib/ceph/tmp/mnt.9k0GDx (/dev/sda1) unmounted
Running command: wipefs --all /dev/sda1
 stderr: wipefs: error: /dev/sda1: probing initialization failed:
 stderr: Device or resource busy
-->  RuntimeError: command returned non-zero exit status: 1

And, lo and behold, it's right: /dev/sda1 has been remounted as
/var/lib/ceph/osd/ceph-68.

That's after the OSD has been stopped, killed, and destroyed; there
*is no* osd.68.  It happens after the filesystem has been unmounted
twice (once by an explicit umount and once by "lvm zap."  The "lvm
zap" umount shown here with the path /var/lib/ceph/tmp/mnt.9k0GDx
suggests that the remount is happening in the background somewhere
while the lvm zap is running.

If we do the zap before the osd destroy, the same thing happens but
the (still-existing) OSD does not actually restart.  So it's just the
filesystem that won't stay unmounted long enough to destroy it, not
the whole OSD.

What's causing this?  How do we keep the filesystem from lurching out
of the grave in mid-conversion like this?

This is on Debian Stretch with systemd, if that matters.

Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com