Re: Destroyed OSD clinging to wrong disk

Tim Holloway <timh@xxxxxxxxxxxxx> · Wed, 30 Oct 2024 09:24:27 -0400

Dave,

If there's one bitter lesson I learned from IBM's OS/2 OS it was that
one should never store critical information in two different
repositories. There Should Be Only One, and you may replicate it, but
at the end of the day, if you don't have a single point of Authority,
you'll suffer.

Regrettably, Ceph has issues there. Very frequently data displayed in
the Dashboard does not match data from the Ceph command line. Which to
me indicates that the information isn't always coming from the same
place.

To be clear, I'm not talking about the old /etc/ceph stuff versus the
more modern configuration database, I'm talking about cases where
apparently sometimes info comes from components (such as direct from an
OSD) and sometimes from somewhere else and they're not staying in sync.

I feel your pain. For certain versions of Ceph, it is possible to have
the same OSD defined both as administered and legacy. The administered
stuff tends to have dynamically-defined systemd units, which means you
can't simply delete the offending service file. Or even find it, unless
you know where such things live.

Go back through this list's history to about June and you'll see a lot
of wailing from me about that sort of thing and the "phantomm host"
issue, where a non-ceph host managed to insinuate itself into the mix
and took forever to expunge. I'm very grateful to Eugen for the help
there. It's possible you might find some insights if you wade through
it.

To the best of my knowledge everything relating to an OSD resides in
one of three places:

1. The /etc/ceph directory (mostly deprecated except for maybe
keyring). And of course, the FSID!

2. The Ceph configuration repository (possibly keyring, not sure if
much else).

3, The Ceph OSD directory under /var/lib/ceph. Whether legacy or
administered, the exact path may differ, but the overall layout is the
same. One directory per OSD. Everything important relating to the OSD
is there, or at least linked from there.

You haven't fully purged a defective OSD until it no longer has a
presence in either the "ceph osd tree" command, the "ceph orch ps"
command or in the OSD host's systemctl list as an "osd" service.

Which is easier said than done, but setting the unwanted OSD's weights
to 0 is a major help.

In one particular case where I had a doubly-defined OSD, I think I
ultimately cured it by turning off the OSD, deleting an OSD service
file for the legacy OSD definition from /etc/systemd/system, then
drawing a deep breath and doing a "rm -rf /var/lib/ceph/osd,xx",
leaving the /var/lib/ceph/<fsid>/osd,xxx alone. Followed by an OSD
restart. But do check my previously-mentioned messages to make sure
there aren't some "gotchas" that I forgot.

If you have issues with the raw data store under the OSD, then it would
take someone wiser and braver than me to repair it without first
deleting  all OSD definitions that reference it, zapping the raw data
store to remove all ceph admin and LVM info that might offend ceph,
then re-defining the OSD on the cleaned data store.

While Ceph can be a bit crotchety, I'll give it credit for one thing.
Even broken it's robust enough I've never lost or corrupted the actual
data, despite the fact that I've done an uncomfortable amount of stuff
where I'm just randomly banging on things with a hammer.

I still do backups, though. :)

Now if I could just persuade the auto-tuner to actually adjust the pg
sizes the way I told it to.

   Tim

On Tue, 2024-10-29 at 22:37 -0400, Dave Hall wrote:
> Tim,
> 
> Thank you for your guidance.  Your points are completely understood. 
> It
> was more that I couldn't figure out why the Dashboard was telling me
> that
> the destroyed OSD was still using /dev/sdi when the physical disk
> with that
> serial number was at /dev/sdc, and when another OSD was also
> reporting
> /dev/sdi.  I figured that there must be some information buried
> somewhere.
> I don't know where this metadata comes from or how it gets updated
> when
> things like 'drive letters' change, but the metadata matched what the
> dashboard showed, so now I know something new.
> 
> Regarding the process for bringing the OSD back online with a new
> HDD, I am
> still having some difficulties.  I used the steps in the
> Adding/Removing
> OSDs document under Removing the OSD, and the OSD mostly appears to
> be
> gone.  However, attempts to use 'ceph-volume lvm prepare' to build
> the
> remplacement OSD are failing,   Same thing with 'ceph orch daemon add
> osd'.
> 
> I think the problem might be that the NVMe LV that was the WAL/DB for
> the
> failed OSD did not get cleaned up, but on my systems 4 OSDs use the
> same
> NVMe drive for WAL/DB, so I'm not sure how to proceed.
> 
> Any suggestions would be welcome.
> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
> 
> 
> On Tue, Oct 29, 2024 at 3:13 PM Tim Holloway <timh@xxxxxxxxxxxxx>
> wrote:
> 
> > Take care when reading the output of "ceph osd metadata". When you
> > are
> > running the OSD as an administered service, it's running in a
> > container,
> > and a container is a miniature VM. So, for example, it may report
> > your
> > OS as "CentOS Stream 8" even if your actual machine is running
> > Ubuntu.
> > 
> > 
> > The biggest pitfall is in paths, because in certain cases -
> > definitely
> > for OSDs - internally the path for the OSD metadata and data store
> > will
> > be /var/lib/ceph/osd, but the actual path in the machine's OS will
> > be
> > /var/lib/ceph/<fsid>/osd, where the container simply mounts that
> > for its
> > internal path.
> > 
> > In other words, "ceph osd metadata" formulates its reports by
> > having the
> > containers assemble the report data and the output is thus the
> > OSD's
> > internal view, not your server's view.
> > 
> >     Tim
> > 
> > 
> > On 10/28/24 14:01, Dave Hall wrote:
> > > Hello.
> > > 
> > > Thanks to Rober's reply to 'Influencing the osd.id
> > > <http://osd.id>'
> > > I've learned two new commands today.  I can now see that 'ceph
> > > osd
> > > metadata'  confirms that I have two OSDs pointing to the same
> > > physical
> > > disk name:
> > > 
> > >     root@ceph09:/# ceph osd metadata 12 | grep sdi
> > >         "bluestore_bdev_devices": "sdi",
> > >         "device_ids":
> > > 
> >  "nvme0n1=SAMSUNG_MZPLL1T6HEHP-
> > 00003_S3HBNA0KA03264,sdi=SEAGATE_ST12000NM0027_*ZJV5TX47*0000C9470Z
> > WA",
> > >         "device_paths":
> > > 
> >  "nvme0n1=/dev/disk/by-path/pci-0000:83:00.0-nvme-
> > 1,sdi=/dev/disk/by-path/pci-0000:41:00.0-sas-phy18-lun-0",
> > >         "devices": "nvme0n1,sdi",
> > >         "objectstore_numa_unknown_devices": "nvme0n1,sdi",
> > >     root@ceph09:/# ceph osd metadata 9 | grep sdi
> > >         "bluestore_bdev_devices": "sdi",
> > >         "device_ids":
> > > 
> >  "nvme1n1=Samsung_SSD_983_DCT_M.2_1.92TB_S48DNC0N701016D,sdi=SEAGAT
> > E_ST12000NM0027_*ZJV5SMTQ*0000C9128FE0",
> > >         "device_paths":
> > > 
> >  "nvme1n1=/dev/disk/by-path/pci-0000:01:00.0-nvme-
> > 1,sdi=/dev/disk/by-path/pci-0000:41:00.0-sas-phy6-lun-0",
> > >         "devices": "nvme1n1,sdi",
> > >         "objectstore_numa_unknown_devices": "sdi",
> > > 
> > > 
> > > However, even though OSD 12 is saying sdi, at least it is
> > > pointing to
> > > the serial number of the failed disk.  However, the disk with
> > > that
> > > serial number is currently residing at /dev/sdc.
> > > 
> > > Is there a way to force the record for the destroyed OSD to point
> > > to
> > > /dev/sdc?
> > > 
> > > Thanks.
> > > 
> > > -Dave
> > > 
> > > --
> > > Dave Hall
> > > Binghamton University
> > > kdhall@xxxxxxxxxxxxxx
> > > 
> > > On Mon, Oct 28, 2024 at 11:47 AM Dave Hall
> > > <kdhall@xxxxxxxxxxxxxx>
> > wrote:
> > > 
> > >     Hello.
> > > 
> > >     The following is on a Reef Podman installation:
> > > 
> > >     In attempting to deal over the weekend with a failed OSD
> > > disk, I
> > >     have somehow managed to have two OSDs pointing to the same
> > > HDD, as
> > >     shown below.
> > > 
> > >     image.png
> > > 
> > >     To be sure, the failure occurred on OSD.12, which was
> > > pointing to
> > >     /dev/sdi.
> > > 
> > >     I disabled the systemd unit for OSD.12 because it kept
> > >     restarting.  I then destroyed it.
> > > 
> > >     When I physically removed the failed disk and rebooted the
> > > system,
> > >     the disk enumeration changed.  So, before the reboot, OSD.12
> > > was
> > >     using /dev/sdi.  After the reboot, OSD.9 moved to /dev/sdi.
> > > 
> > >     I didn't know that I had an issue until 'ceph-volume lvm
> > > prepare'
> > >     failed.  It was in the process of investigating this that I
> > > found
> > >     the above.  Right now I have reinserted the failed disk and
> > >     rebooted, hoping that OSD.12 would find its old disk by some
> > > other
> > >     means, but no joy.
> > > 
> > >     My concern is that if I run 'ceph osd rm' I could take out
> > > OSD.9.
> > >     I could take the precaution of marking OSD.9 out and let it
> > > drain,
> > >     but I'd rather not.  I am, perhaps, more inclined to manually
> > >     clear the lingering configuration associated with OSD.12 if
> > >     someone could send me the list of commands. Otherwise, I'm
> > > open to
> > >     suggestions.
> > > 
> > >     Thanks.
> > > 
> > >     -Dave
> > > 
> > >     --
> > >     Dave Hall
> > >     Binghamton University
> > >     kdhall@xxxxxxxxxxxxxx
> > > 
> > > 
> > > _______________________________________________
> > > ceph-users mailing list --ceph-users@xxxxxxx
> > > To unsubscribe send an email toceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx