Problem with OSDs

Brian Topping <brian.topping@xxxxxxxxx> · Sun, 20 Jan 2019 21:30:50 -0700

Hi all, looks like I might have pooched something. Between the two nodes I have, I moved all the PGs to one machine, reformatted the other machine, rebuilt that machine, and moved the PGs back. In both cases, I did this by taking the OSDs on the machine being moved from “out” and waiting for health to be restored, then took them down. 

This worked great up to the point I had the mon/manager/rgw where they started, all the OSDs/PGs on the other machine that had been rebuilt. The next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then re-add new OSDs on the master machine as it were.

This didn’t work so well. The master has come up just fine, but it’s not connecting to the OSDs. Of the four OSDs, only two came up, and the other two did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like the following in it’s logs:

> [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries left: 2
> [2019-01-20 16:22:15,111][ceph_volume.process][INFO  ] Running command: /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> [2019-01-20 16:22:15,271][ceph_volume.process][INFO  ] stderr -->  RuntimeError: could not find osd.1 with fsid e3bfc69e-a145-4e19-aac2-5f888e1ed2ce

I see this for the volumes:

> [root@gw02 ceph]# ceph-volume lvm list 
> 
> ====== osd.1 =======
> 
>   [block]    /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> 
>       type                      block
>       osd id                    1
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  4672bb90-8cea-4580-85f2-1e692811a05a
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
>       block device              /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sda3
> 
> ====== osd.3 =======
> 
>   [block]    /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> 
>       type                      block
>       osd id                    3
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  084cf33d-8a38-4c82-884a-7c88e3161479
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
>       block device              /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdb3
> 
> ====== osd.5 =======
> 
>   [block]    /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> 
>       type                      block
>       osd id                    5
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  e854930d-1617-4fe7-b3cd-98ef284643fd
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
>       block device              /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdc3
> 
> ====== osd.7 =======
> 
>   [block]    /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> 
>       type                      block
>       osd id                    7
>       cluster fsid              1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>       cluster name              ceph
>       osd fsid                  5c0d0404-390e-4801-94a9-da52c104206f
>       encrypted                 0
>       cephx lockbox secret      
>       block uuid                wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
>       block device              /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
>       vdo                       0
>       crush device class        None
>       devices                   /dev/sdd3

What I am wondering is if device mapper has lost something with a kernel or library change:

> [root@gw02 ceph]# ls -l /dev/dm*
> brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> [root@gw02 ~]# dmsetup ls
> ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f	(253:1)
> ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479	(253:4)
> ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd	(253:2)
> hndc1.centos02-root	(253:0)
> ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a	(253:3)

How can I debug this? I suspect this is just some kind of a UID swap that that happened somewhere, but I don’t know what the chain of truth is through the database files to connect the two together and make sure I have the correct OSD blocks where the mon expects to find them.

Thanks! Brian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com