Tim, Thank you for your guidance. Your points are completely understood. It was more that I couldn't figure out why the Dashboard was telling me that the destroyed OSD was still using /dev/sdi when the physical disk with that serial number was at /dev/sdc, and when another OSD was also reporting /dev/sdi. I figured that there must be some information buried somewhere. I don't know where this metadata comes from or how it gets updated when things like 'drive letters' change, but the metadata matched what the dashboard showed, so now I know something new. Regarding the process for bringing the OSD back online with a new HDD, I am still having some difficulties. I used the steps in the Adding/Removing OSDs document under Removing the OSD, and the OSD mostly appears to be gone. However, attempts to use 'ceph-volume lvm prepare' to build the remplacement OSD are failing, Same thing with 'ceph orch daemon add osd'. I think the problem might be that the NVMe LV that was the WAL/DB for the failed OSD did not get cleaned up, but on my systems 4 OSDs use the same NVMe drive for WAL/DB, so I'm not sure how to proceed. Any suggestions would be welcome. Thanks. -Dave -- Dave Hall Binghamton University kdhall@xxxxxxxxxxxxxx On Tue, Oct 29, 2024 at 3:13 PM Tim Holloway <timh@xxxxxxxxxxxxx> wrote: > Take care when reading the output of "ceph osd metadata". When you are > running the OSD as an administered service, it's running in a container, > and a container is a miniature VM. So, for example, it may report your > OS as "CentOS Stream 8" even if your actual machine is running Ubuntu. > > > The biggest pitfall is in paths, because in certain cases - definitely > for OSDs - internally the path for the OSD metadata and data store will > be /var/lib/ceph/osd, but the actual path in the machine's OS will be > /var/lib/ceph/<fsid>/osd, where the container simply mounts that for its > internal path. > > In other words, "ceph osd metadata" formulates its reports by having the > containers assemble the report data and the output is thus the OSD's > internal view, not your server's view. > > Tim > > > On 10/28/24 14:01, Dave Hall wrote: > > Hello. > > > > Thanks to Rober's reply to 'Influencing the osd.id <http://osd.id>' > > I've learned two new commands today. I can now see that 'ceph osd > > metadata' confirms that I have two OSDs pointing to the same physical > > disk name: > > > > root@ceph09:/# ceph osd metadata 12 | grep sdi > > "bluestore_bdev_devices": "sdi", > > "device_ids": > > > "nvme0n1=SAMSUNG_MZPLL1T6HEHP-00003_S3HBNA0KA03264,sdi=SEAGATE_ST12000NM0027_*ZJV5TX47*0000C9470ZWA", > > "device_paths": > > > "nvme0n1=/dev/disk/by-path/pci-0000:83:00.0-nvme-1,sdi=/dev/disk/by-path/pci-0000:41:00.0-sas-phy18-lun-0", > > "devices": "nvme0n1,sdi", > > "objectstore_numa_unknown_devices": "nvme0n1,sdi", > > root@ceph09:/# ceph osd metadata 9 | grep sdi > > "bluestore_bdev_devices": "sdi", > > "device_ids": > > > "nvme1n1=Samsung_SSD_983_DCT_M.2_1.92TB_S48DNC0N701016D,sdi=SEAGATE_ST12000NM0027_*ZJV5SMTQ*0000C9128FE0", > > "device_paths": > > > "nvme1n1=/dev/disk/by-path/pci-0000:01:00.0-nvme-1,sdi=/dev/disk/by-path/pci-0000:41:00.0-sas-phy6-lun-0", > > "devices": "nvme1n1,sdi", > > "objectstore_numa_unknown_devices": "sdi", > > > > > > However, even though OSD 12 is saying sdi, at least it is pointing to > > the serial number of the failed disk. However, the disk with that > > serial number is currently residing at /dev/sdc. > > > > Is there a way to force the record for the destroyed OSD to point to > > /dev/sdc? > > > > Thanks. > > > > -Dave > > > > -- > > Dave Hall > > Binghamton University > > kdhall@xxxxxxxxxxxxxx > > > > On Mon, Oct 28, 2024 at 11:47 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> > wrote: > > > > Hello. > > > > The following is on a Reef Podman installation: > > > > In attempting to deal over the weekend with a failed OSD disk, I > > have somehow managed to have two OSDs pointing to the same HDD, as > > shown below. > > > > image.png > > > > To be sure, the failure occurred on OSD.12, which was pointing to > > /dev/sdi. > > > > I disabled the systemd unit for OSD.12 because it kept > > restarting. I then destroyed it. > > > > When I physically removed the failed disk and rebooted the system, > > the disk enumeration changed. So, before the reboot, OSD.12 was > > using /dev/sdi. After the reboot, OSD.9 moved to /dev/sdi. > > > > I didn't know that I had an issue until 'ceph-volume lvm prepare' > > failed. It was in the process of investigating this that I found > > the above. Right now I have reinserted the failed disk and > > rebooted, hoping that OSD.12 would find its old disk by some other > > means, but no joy. > > > > My concern is that if I run 'ceph osd rm' I could take out OSD.9. > > I could take the precaution of marking OSD.9 out and let it drain, > > but I'd rather not. I am, perhaps, more inclined to manually > > clear the lingering configuration associated with OSD.12 if > > someone could send me the list of commands. Otherwise, I'm open to > > suggestions. > > > > Thanks. > > > > -Dave > > > > -- > > Dave Hall > > Binghamton University > > kdhall@xxxxxxxxxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list --ceph-users@xxxxxxx > > To unsubscribe send an email toceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx