Re: [External Email] Re: Recreate Destroyed OSD

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 6 Nov 2024 17:04:36 +0100 (CET)

----- Le 1 Nov 24, à 19:28, Dave Hall kdhall@xxxxxxxxxxxxxx a écrit :

> Tim,
> 
> Actually, the links the Eugen shared earlier were sufficient.  I ended up
> with
> 
> service_type: osd
> service_name: osd
> placement:
>  host_pattern: 'ceph01'
> spec:
>  data_devices:
>    rotational: 1
>  db_devices:
>    rotational: 0
> 
> 
> This worked exactly right as far as creating the OSD - it found and reused
> the same OSD number that was previously destroyed, and also recreated the
> WAL/DB LV using the 'blank spot' on the NVMe drive.
> 
> However, I'm a bit concerned that the output of 'ceph orch ls osd' has
> changed in a way that might not be quite good:
> 
> NAME  PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
> osd               32  3m ago     52m  ceph01
> 
> 

Hi Dave,

Was this cluster cephadm adopted? That is moved from non-containerized to containerized cluster.

This would explain why you have these 32 OSDs bound to this unmanaged and actually non-existing (I just figured that out...) 'osd' service.

This service appears in 'ceph orch ls' output as soon as an OSD is converted to a container and restarted with a service_name set to osd.osd in its systemd unit.meta file. But this service doesn't actually exist, in the sens that you cannot remove it (hence the ceph orch rm issue we discussed here a couple of weeks back). You could use this service but I would avoid doing so as it doesn't really exist (here may be a bug) and will disappear as soon as no more OSDs reference it.

> Before all of this started this line used to contain the word 'unmanaged'
> somewhere.  Eugen and I were having a side discussion about how to make all
> of my OSDs managed without destroying them, so I could do things like 'ceph
> orch restart osd' to restart all of the OSDs to assure that the pick up
> changes to attributes like osd_memory_target and osd_memory_target_autotune,
> 
> So, in applying this spec, did I make all my OSDs managed, or just all of
> the ones on ceph01, or just the one that got created when I applied the
> spec?
> 

By applying this spec, you've asked the orchestrator to create OSDs out of any available disks it will find on host ceph01 **only**.
The orchestrator will not try to create any OSDs on any other hosts after you limit the scope of the service to this particular host (host_pattern actually). 

Note that this reduced scope doesn't prevent the orchestrator from restarting all OSDs would you run a 'ceph orch restart osd.osd'.

> When I add my next host, should I change the placement to that host name or
> to '*'?

You could enumerate all hosts one by one or use a pattern like 'ceph0[1-2]'

You may also use regex patterns depending on the version of Ceph that you're using. Check [1].
Regex patterns should be available in next minor Quincy release 17.2.8.

[1] https://github.com/ceph/ceph/pull/53803
[2] https://github.com/ceph/ceph/pull/56222

> 
> More generally, is there a higher level document that talks about Ceph spec
> files and the orchestrator - something that deals with the general concepts?

I think Eugen got you on track. :-)

Cheers,
Frédéric.

> 
> Thanks.
> 
> -Dave
> 
> --
> Dave Hall
> Binghamton University
> kdhall@xxxxxxxxxxxxxx
> 
> On Fri, Nov 1, 2024 at 1:40 PM Tim Holloway <timh@xxxxxxxxxxxxx> wrote:
> 
>> I can't offer a spec off the cuff, but if the LV still exists and you
>> don't need to change its size, then I'd zap it to remove residual Ceph
>> info because otherwise the operation will complain and fail.
>>
>> Having done that, the requirements should be the same as a first-time
>> construction of an OSD on that LV. Eugen can likely give you the spec
>> info. I'd have to RTFM.
>>
>>     Tim
>>
>>
>> On 11/1/24 11:22, Dave Hall wrote:
>> > Tim, Eugen,
>> >
>> > So what would a spec file look like for a single OSD that uses a specific
>> > HDD (/dev/sdi) and with WAL/DB on an LV that's 25% of a specific NVMe
>> > drive?  Regarding the NVMe, there are 3 other OSDs already using 25% each
>> > of this NVMe for WAL/DB, but I have removed the LV that was used by the
>> > failed OSD.  Do I need to pre-create the LV, or will 'ceph orch' do that
>> > for me?
>> >
>> > Thanks.
>> >
>> > -Dave
>> >
>> > --
>> > Dave Hall
>> > Binghamton University
>> > kdhall@xxxxxxxxxxxxxx
>> >
>> > On Thu, Oct 31, 2024 at 3:52 PM Tim Holloway <timh@xxxxxxxxxxxxx> wrote:
>> >
>> >> I migrated from gluster when I found out it's going unsupported shortly.
>> >> I'm really not big enough for Ceph proper, but there were only so many
>> >> supported distributed filesystems with triple redundancy.
>> >>
>> >> Where I got into trouble was that I started off with Octopus and Octopus
>> >> had some teething pains. Like stalling scheduled operations until the
>> >> system was clan but the only way to get a clean system was to run the
>> >> stalled operations. Pacific cured that for me.
>> >>
>> >> But the docs were and remain somewhat fractured between legacy and
>> >> managed services and I managed to get into a real mess there, especially
>> >> since I was wildly trying anything to get those stalled fixes to take.
>> >>
>> >> Since then, I've pretty much redefined all my OSDs with fewer but larger
>> >> datastores and made them all managed. Now if I could just persuade the
>> >> auto-tuner to fix the PG sizes,
>> >>
>> >> I'm in the process of opening a ticket account right now. The fun part
>> >> of this is that realistically, older docs need a re-write just as much
>> >> as the docs for the current release.
>> >>
>> >>      Tim
>> >>
>> >> On 10/31/24 15:39, Eugen Block wrote:
>> >>> I completely understand your point of view. Our own main cluster is
>> >>> also a bit "wild" in its OSD layout, that's why its OSDs are
>> >>> "unmanaged" as well. When we adopted it via cephadm, I started to
>> >>> create suitable osd specs for all those hosts and OSDs and I gave up.
>> >>> :-D But since we sometimes also tend to experiment a bit, I rather
>> >>> have full control over it. That's why we also have
>> >>> osd_crush_initial_weight = 0, to check the OSD creation before letting
>> >>> Ceph remap any PGs.
>> >>>
>> >>> It definitely couldn't hurt to clarify the docs, you can always report
>> >>> on tracker.ceph.com if you have any improvement ideas.
>> >>>
>> >>> Zitat von Tim Holloway <timh@xxxxxxxxxxxxx>:
>> >>>
>> >>>> I have been slowly migrating towards spec files as I prefer
>> >>>> declarative management as a rule.
>> >>>>
>> >>>> However, I think that we may have a dichotomy in the user base.
>> >>>>
>> >>>> On the one hand, users with dozens/hundreds of server/drives of
>> >>>> basically identical character.
>> >>>>
>> >>>> On the other, I'm one who's running fewer servers and for historical
>> >>>> reasons they tend to be wildly individualistic and often have blocks
>> >>>> of future-use space reserved for non-ceph storage.
>> >>>>
>> >>>> Ceph, left to its own devices (no pun intended) can be quite
>> >>>> enthusiastic about adopting any storage it can find. And that's great
>> >>>> for users in the first category. Which is what the spec information
>> >>>> in the supplied links is emphasizing. But for us lesser creatures who
>> >>>> feel the need to manually control where each OSD and how it's
>> >>>> configured, it's not so simple. I'm fairly certain that there's
>> >>>> documentation on the spec file setup for that sort of stuff in the
>> >>>> online docs, but it's located somewhere else and I cannot recall
>> where.
>> >>>>
>> >>>> At any rate I would consider it very important that the different
>> >>>> ways to setup an OSD should explicitly indicate which type of OSD
>> >>>> will be generated in their documentation.
>> >>>>
>> >>>>     Tim
>> >>>>
>> >>>>
>> >>>> On 10/31/24 14:28, Eugen Block wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> the preferred method to deploy OSDs in cephadm managed clusters are
>> >>>>> spec files, see this part of the docs [0] for more information. I
>> >>>>> would just not use the '--all-available-devices' flag, except in
>> >>>>> test clusters, or if you're really sure that this is what you want.
>> >>>>>
>> >>>>> If you use 'ceph orch daemon add osd ...', you'll end up with one
>> >>>>> (or more) OSD(s), but they will be unmanaged, as you already noted
>> >>>>> in your own cluster. There are a couple of examples with advanced
>> >>>>> specs (e. g. DB/WAL on dedicated devices) in the docs as well [1].
>> >>>>> So my recommendation would be to have a suiting spec file for your
>> >>>>> disk layout. You can always check with the '--dry-run' flag before
>> >>>>> actually applying it:
>> >>>>>
>> >>>>> ceph orch apply -i osd-spec.yaml --dry-run
>> >>>>>
>> >>>>> Regards,
>> >>>>> Eugen
>> >>>>>
>> >>>>> [0]
>> https://docs.ceph.com/en/latest/cephadm/services/osd/#deploy-osds
>> >>>>> [1]
>> >>>>>
>> >>
>> https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications
>> >>>>> Zitat von Tim Holloway <timh@xxxxxxxxxxxxx>:
>> >>>>>
>> >>>>>> As I understand it, the manual OSD setup is only for legacy
>> >>>>>> (non-container) OSDs. Directory locations are wrong for managed
>> >>>>>> (containerized) OSDs, for one.
>> >>>>>>
>> >>>>>> Actually, the whole manual setup docs ought to be moved out of the
>> >>>>>> mainline documentation. In their present arrangement, they make
>> >>>>>> legacy setup sound like the preferred method. And have you noticed
>> >>>>>> that there is no corresponding well-marked section titled
>> >>>>>> "Authomated (cephadmin) setup?".
>> >>>>>>
>> >>>>>> This is how we end up with OSDs that are simultaneously legacy AND
>> >>>>>> administered for the same OSD, since at last count there are no
>> >>>>>> interlocks within Ceph to prevent such a mess.
>> >>>>>>
>> >>>>>>     Tim
>> >>>>>>
>> >>>>>> On 10/31/24 13:39, Dave Hall wrote:
>> >>>>>>> Hello.
>> >>>>>>>
>> >>>>>>> Sorry if it appears that I am reposting the same issue under a
>> >>>>>>> different
>> >>>>>>> topic.  However, I feel that the problem has moved and I now have
>> >>>>>>> different
>> >>>>>>> questions.
>> >>>>>>>
>> >>>>>>> At this point I have, I believe, removed all traces of OSD.12 from
>> my
>> >>>>>>> cluster - based on steps in the Reef docs at
>> >>>>>>> https://docs.ceph.com/en/reef/rados/operations/add-or-rm-osds/#.
>> >>>>>>> I have
>> >>>>>>> further located and removed the WAL/DB LV on an associated NVMe
>> drive
>> >>>>>>> (shared with 3 other OSDs).
>> >>>>>>>
>> >>>>>>> I don't believe the instructions for replacing an OSD (ceph-volume
>> >>>>>>> lvm
>> >>>>>>> prepare) still apply, so I have been trying to work with the
>> >>>>>>> instructions
>> >>>>>>> under ADDING AN OSD (MANUAL).
>> >>>>>>>
>> >>>>>>> However, since my installation is containerized (Podman), it is
>> >>>>>>> unclear
>> >>>>>>> which steps should be issued on the host and which within 'cephadm
>> >>>>>>> shell'.
>> >>>>>>>
>> >>>>>>> There is also another ambiguity:  In step 3 the instruction is to
>> >>>>>>> 'mkfs -t
>> >>>>>>> {fstype}' and then to 'mount -o user_xattr'.  However, which fs
>> type?
>> >>>>>>>
>> >>>>>>> After this, in step 4, the 'ceph-osd -i {osd-id} --mkfs --mkkey'
>> gets
>> >>>>>>> throws errors about the keyring file.
>> >>>>>>>
>> >>>>>>> So, are these the right instructions to be using in a containerized
>> >>>>>>> installation?  Are there, in general, alternate documents for
>> >>>>>>> containerized
>> >>>>>>> installations?
>> >>>>>>>
>> >>>>>>> Lastly, the above cited instructions don't say anything about the
>> >>>>>>> separate
>> >>>>>>> WAL/DB LV.
>> >>>>>>>
>> >>>>>>> Please advise.
>> >>>>>>>
>> >>>>>>> Thanks.
>> >>>>>>>
>> >>>>>>> -Dave
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Dave Hall
>> >>>>>>> Binghamton University
>> >>>>>>> kdhall@xxxxxxxxxxxxxx
>> >>>>>>> _______________________________________________
>> >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>>>> _______________________________________________
>> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx