Re: Zombie OSD filesystems rise from the grave during bluestore conversion

Paul Emmerich <paul.emmerich@xxxxxxxx> · Mon, 4 Nov 2019 19:32:04 +0100

That's probably the ceph-disk udev script being triggered from
something somewhere (and a lot of things can trigger that script...)

Work-around: convert everything to ceph-volume simple first by running
"ceph-volume simple scan" and "ceph-volume simple activate", that will
disable udev in the intended way.

BTW: you can run destroy before stopping the OSD, you won't need the
--yes-i-really-mean-it if it's drained in this case

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Nov 4, 2019 at 6:33 PM J David <j.david.lists@xxxxxxxxx> wrote:
>
> While converting a luminous cluster from filestore to bluestore, we
> are running into a weird race condition on a fairly regular basis.
>
> We have a master script that writes upgrade scripts for each OSD
> server.  The script for an OSD looks like this:
>
> ceph osd out 68
> while ! ceph osd safe-to-destroy 68 ; do sleep 10 ; done
> systemctl stop ceph-osd@68
> sleep 10
> systemctl kill ceph-osd@68
> sleep 10
> umount /var/lib/ceph/osd/ceph-68
> ceph osd destroy 68 --yes-i-really-mean-it
> ceph-volume lvm zap /dev/sda --destroy
> ceph-volume lvm create --bluestore --data /dev/sda --osd-id 68
> sleep 10
> while [ "`ceph health`" != "HEALTH_OK" ] ; do ceph health; sleep 10 ; done
>
> (It's run with sh -e so any error will cause an abort.)
>
> The problem we run into is that in about 1 out of 10 runs, when this
> gets to the "lvm zap" stage, and fails:
>
> --> Zapping: /dev/sda
> Running command: wipefs --all /dev/sda2
> Running command: dd if=/dev/zero of=/dev/sda2 bs=1M count=10
>  stderr: 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB, 10 MiB) copied, 0.00667608 s, 1.6 GB/s
> --> Destroying partition since --destroy was used: /dev/sda2
> Running command: parted /dev/sda --script -- rm 2
> --> Unmounting /dev/sda1
> Running command: umount -v /dev/sda1
>  stderr: umount: /var/lib/ceph/tmp/mnt.9k0GDx (/dev/sda1) unmounted
> Running command: wipefs --all /dev/sda1
>  stderr: wipefs: error: /dev/sda1: probing initialization failed:
>  stderr: Device or resource busy
> -->  RuntimeError: command returned non-zero exit status: 1
>
> And, lo and behold, it's right: /dev/sda1 has been remounted as
> /var/lib/ceph/osd/ceph-68.
>
> That's after the OSD has been stopped, killed, and destroyed; there
> *is no* osd.68.  It happens after the filesystem has been unmounted
> twice (once by an explicit umount and once by "lvm zap."  The "lvm
> zap" umount shown here with the path /var/lib/ceph/tmp/mnt.9k0GDx
> suggests that the remount is happening in the background somewhere
> while the lvm zap is running.
>
> If we do the zap before the osd destroy, the same thing happens but
> the (still-existing) OSD does not actually restart.  So it's just the
> filesystem that won't stay unmounted long enough to destroy it, not
> the whole OSD.
>
> What's causing this?  How do we keep the filesystem from lurching out
> of the grave in mid-conversion like this?
>
> This is on Debian Stretch with systemd, if that matters.
>
> Thanks!
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com