Replace ceph osd in a container

Alex Litvak <alexander.v.litvak@xxxxxxxxx> · Tue, 22 Oct 2019 01:04:48 -0500

Hello cephers,

So I am having trouble with a new hardware systems with strange OSD behavior and I want to replace a disk with a brand new one to test the theory.

I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 osds.  So following https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd

I stopped container with osd.23, waited until it is down and out, ran safe-to-destroy loop and then destroyed the osd all using the monitor from the container on this node.  All good.

Then I swapped the SSDs and started running additional steps (from step 3) using the same mon container.  I have no ceph packages installed on the bare metal box. It looks like mon container doesn't 
see the disk.

    podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
 stderr: lsblk: /dev/sdh: not a block device
 stderr: error: /dev/sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
                           [--osd-fsid OSD_FSID]
                           [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: /dev/sdh
Error: exit status 2
root@storage2n2-la:~# ls -l /dev/sd
sda   sdc   sdd   sde   sdf   sdg   sdg1  sdg2  sdg5  sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap sdh
 stderr: lsblk: sdh: not a block device
 stderr: error: sdh: No such file or directory
 stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
                           [--osd-fsid OSD_FSID]
                           [DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh
Error: exit status 2

I execute lsblk and it sees device sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
lsblk: dm-1: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-3: failed to get device path
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdf      8:80   0   1.8T  0 disk
sdd      8:48   0   1.8T  0 disk
sdg      8:96   0 223.5G  0 disk
|-sdg5   8:101  0   223G  0 part
|-sdg1   8:97       487M  0 part
`-sdg2   8:98         1K  0 part
sde      8:64   0   1.8T  0 disk
sdc      8:32   0   3.5T  0 disk
sda      8:0    0   3.5T  0 disk
sdh      8:112  0   3.5T  0 disk

So I use a fellow osd container (osd.5) on the same node and run all of the operations (zap and prepare) successfully.

I am suspecting that mon or mgr have no access to /dev or /var/lib while osd containers do.  Cluster configured originally by ceph-ansible (nautilus 14.2.2)

The question is if I want to replace all disks on a single node, and I have 6 nodes with pools replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph volumes (not configured right now).

I cannot use other osd containers on the same box because my controller reverts from raid to non-raid mode with all disks lost and not just a single one.  So I need to replace all 6 osds to run back 
in containers and the only things will remain operational on node are mon and mgr containers.

I prefer not to install a full cluster or client on the bare metal node if possible.

Thank you for your help,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com