On Sun, Jan 20, 2019 at 11:30 PM Brian Topping <brian.topping@xxxxxxxxx> wrote: > > Hi all, looks like I might have pooched something. Between the two nodes I have, I moved all the PGs to one machine, reformatted the other machine, rebuilt that machine, and moved the PGs back. In both cases, I did this by taking the OSDs on the machine being moved from “out” and waiting for health to be restored, then took them down. > > This worked great up to the point I had the mon/manager/rgw where they started, all the OSDs/PGs on the other machine that had been rebuilt. The next step was to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then re-add new OSDs on the master machine as it were. > > This didn’t work so well. The master has come up just fine, but it’s not connecting to the OSDs. Of the four OSDs, only two came up, and the other two did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like the following in it’s logs: > > > [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries left: 2 > > [2019-01-20 16:22:15,111][ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce > > [2019-01-20 16:22:15,271][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.1 with fsid e3bfc69e-a145-4e19-aac2-5f888e1ed2ce When creating an OSD, ceph-volume will capture the ID and the FSID and use these to create a systemd unit. When the system boots, it queries LVM for devices that match that ID/FSID information. Is it possible you've attempted to create an OSD and then failed, and tried again? That would explain why there would be a systemd unit with an FSID that doesn't match. By the output, it does look like you have an OSD 1, but with a different FSID (467... instead of e3b...). You could try to disable the failing systemd unit with: systemctl disable ceph-volume@lvm-1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce.service (Follow up with OSD 3) and then run: ceph-volume lvm activate --all Hopefully that can get you back into activated OSDs > > > I see this for the volumes: > > > [root@gw02 ceph]# ceph-volume lvm list > > > > ====== osd.1 ======= > > > > [block] /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > > > > type block > > osd id 1 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 4672bb90-8cea-4580-85f2-1e692811a05a > > encrypted 0 > > cephx lockbox secret > > block uuid 3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff > > block device /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a > > vdo 0 > > crush device class None > > devices /dev/sda3 > > > > ====== osd.3 ======= > > > > [block] /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > > > > type block > > osd id 3 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 084cf33d-8a38-4c82-884a-7c88e3161479 > > encrypted 0 > > cephx lockbox secret > > block uuid PSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7 > > block device /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479 > > vdo 0 > > crush device class None > > devices /dev/sdb3 > > > > ====== osd.5 ======= > > > > [block] /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > > > > type block > > osd id 5 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid e854930d-1617-4fe7-b3cd-98ef284643fd > > encrypted 0 > > cephx lockbox secret > > block uuid F5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9 > > block device /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd > > vdo 0 > > crush device class None > > devices /dev/sdc3 > > > > ====== osd.7 ======= > > > > [block] /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > > > > type block > > osd id 7 > > cluster fsid 1cf94ce9-1323-4c43-865f-68f4ae9e6af3 > > cluster name ceph > > osd fsid 5c0d0404-390e-4801-94a9-da52c104206f > > encrypted 0 > > cephx lockbox secret > > block uuid wgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe > > block device /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f > > vdo 0 > > crush device class None > > devices /dev/sdd3 > > What I am wondering is if device mapper has lost something with a kernel or library change: > > > [root@gw02 ceph]# ls -l /dev/dm* > > brw-rw----. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0 > > brw-rw----. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1 > > brw-rw----. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2 > > brw-rw----. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3 > > brw-rw----. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4 > > [root@gw02 ~]# dmsetup ls > > ceph--1f3d4406--af86--4813--8d06--a001c57408fa-osd--block--5c0d0404--390e--4801--94a9--da52c104206f (253:1) > > ceph--f5f453df--1d41--4883--b0f8--d662c6ba8bea-osd--block--084cf33d--8a38--4c82--884a--7c88e3161479 (253:4) > > ceph--033e2bbe--5005--45d9--9ecd--4b541fe010bd-osd--block--e854930d--1617--4fe7--b3cd--98ef284643fd (253:2) > > hndc1.centos02-root (253:0) > > ceph--c7640f3e--0bf5--4d75--8dd4--00b6434c84d9-osd--block--4672bb90--8cea--4580--85f2--1e692811a05a (253:3) > > How can I debug this? I suspect this is just some kind of a UID swap that that happened somewhere, but I don’t know what the chain of truth is through the database files to connect the two together and make sure I have the correct OSD blocks where the mon expects to find them. > > Thanks! Brian > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com