Hi all, looks like I might have pooched something. Between the two nodes I have, I moved all the PGs to one machine, reformatted the other machine, rebuilt that machine, and moved the PGs back. In both cases, I did this by taking the OSDs on the machine being moved from “out” and waiting for health to be restored, then took them down. This worked great up to the point I had the mon/manager/rgw where they started, all the OSDs/PGs on the other machine that had been rebuilt. The next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then re-add new OSDs on the master machine as it were. This didn’t work so well. The master has come up just fine, but it’s not connecting to the OSDs. Of the four OSDs, only two came up, and the other two did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like the following in it’s logs: > [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries left: 2 > [2019-01-20 16:22:15,111][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce > [2019-01-20 16:22:15,271][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with fsid e3bfc69e-a145-4e19-aac2-5f888e1ed2ce I see this for the volumes: > [root@gw02 ceph]# ceph-volume lvm list > > ====== osd.1 ======= > > [block] /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > > type block > osd id 1 > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > cluster name ceph > osd fsid 4672bb90-8cea-4580-85f2-1e692811a05a > encrypted 0 > cephx lockbox secret > block uuid 3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff > block device /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > vdo 0 > crush device class None > devices /dev/sda3 > > ====== osd.3 ======= > > [block] /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > > type block > osd id 3 > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > cluster name ceph > osd fsid 084cf33d-8a38-4c82-884a-7c88e3161479 > encrypted 0 > cephx lockbox secret > block uuid PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7 > block device /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > vdo 0 > crush device class None > devices /dev/sdb3 > > ====== osd.5 ======= > > [block] /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > > type block > osd id 5 > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > cluster name ceph > osd fsid e854930d-1617-4fe7-b3cd-98ef284643fd > encrypted 0 > cephx lockbox secret > block uuid F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9 > block device /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > vdo 0 > crush device class None > devices /dev/sdc3 > > ====== osd.7 ======= > > [block] /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > > type block > osd id 7 > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > cluster name ceph > osd fsid 5c0d0404-390e-4801-94a9-da52c104206f > encrypted 0 > cephx lockbox secret > block uuid wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe > block device /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > vdo 0 > crush device class None > devices /dev/sdd3 What I am wondering is if device mapper has lost something with a kernel or library change: > [root@gw02 ceph]# ls -l /dev/dm* > brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0 > brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1 > brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2 > brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3 > brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4 > [root@gw02 ~]# dmsetup ls > ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f (253:1) > ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479 (253:4) > ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd (253:2) > hndc1.centos02-root (253:0) > ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a (253:3) How can I debug this? I suspect this is just some kind of a UID swap that that happened somewhere, but I don’t know what the chain of truth is through the database files to connect the two together and make sure I have the correct OSD blocks where the mon expects to find them. Thanks! Brian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com