Re: [External Email] Re: Recreate Destroyed OSD

Eugen Block <eblock@xxxxxx> · Wed, 06 Nov 2024 11:29:13 +0000

Dave,

I noticed that the advanced osd spec docs are missing a link to  
placement-by-pattern-matching docs (thanks to Zac and Adam for picking  
that up):

https://docs.ceph.com/en/latest/cephadm/services/#placement-by-pattern-matching

But according to that, your host_pattern specification should have  
worked as well and target only the one node. But you don't have 32  
OSDs running on ceph01, correct? I wonder if that is a bug. Do you  
still have the cephadm and mgr logs from when you applied that spec?
I tried to check across a couple of versions from Quincy to Squid if I  
could reproduce it, but I couldn't. Across all versions the specified  
pattern did only apply to the one node I had there.

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

if you choose "host_pattern", it will try to match the blob as a  
regexp, that's why you probably see what you currently see. If you  
want the spec to be only applicable to a single node, you need to  
specify only "hosts" and then add a list (or a single entry), for  
example:

service_type: osd
service_name: osd
placement:
  hosts:
  - ceph01
spec:
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0

Zitat von Dave Hall <kdhall@xxxxxxxxxxxxxx>:

Tim,

Actually, the links the Eugen shared earlier were sufficient.  I ended up
with

service_type: osd
service_name: osd
placement:
 host_pattern: 'ceph01'
spec:
 data_devices:
   rotational: 1
 db_devices:
   rotational: 0

This worked exactly right as far as creating the OSD - it found and reused
the same OSD number that was previously destroyed, and also recreated the
WAL/DB LV using the 'blank spot' on the NVMe drive.

However, I'm a bit concerned that the output of 'ceph orch ls osd' has
changed in a way that might not be quite good:

NAME  PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
osd               32  3m ago     52m  ceph01

Before all of this started this line used to contain the word 'unmanaged'
somewhere.  Eugen and I were having a side discussion about how to make all
of my OSDs managed without destroying them, so I could do things like 'ceph
orch restart osd' to restart all of the OSDs to assure that the pick up
changes to attributes like osd_memory_target and osd_memory_target_autotune,

So, in applying this spec, did I make all my OSDs managed, or just all of
the ones on ceph01, or just the one that got created when I applied the
spec?

When I add my next host, should I change the placement to that host name or
to '*'?

More generally, is there a higher level document that talks about Ceph spec
files and the orchestrator - something that deals with the general concepts?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx

On Fri, Nov 1, 2024 at 1:40 PM Tim Holloway <timh@xxxxxxxxxxxxx> wrote:

I can't offer a spec off the cuff, but if the LV still exists and you
don't need to change its size, then I'd zap it to remove residual Ceph
info because otherwise the operation will complain and fail.

Having done that, the requirements should be the same as a first-time
construction of an OSD on that LV. Eugen can likely give you the spec
info. I'd have to RTFM.

   Tim

On 11/1/24 11:22, Dave Hall wrote:
Tim, Eugen,

So what would a spec file look like for a single OSD that uses a specific
HDD (/dev/sdi) and with WAL/DB on an LV that's 25% of a specific NVMe
drive?  Regarding the NVMe, there are 3 other OSDs already using 25% each
of this NVMe for WAL/DB, but I have removed the LV that was used by the
failed OSD.  Do I need to pre-create the LV, or will 'ceph orch' do that
for me?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx

On Thu, Oct 31, 2024 at 3:52 PM Tim Holloway <timh@xxxxxxxxxxxxx> wrote:

I migrated from gluster when I found out it's going unsupported shortly.
I'm really not big enough for Ceph proper, but there were only so many
supported distributed filesystems with triple redundancy.

Where I got into trouble was that I started off with Octopus and Octopus
had some teething pains. Like stalling scheduled operations until the
system was clan but the only way to get a clean system was to run the
stalled operations. Pacific cured that for me.

But the docs were and remain somewhat fractured between legacy and
managed services and I managed to get into a real mess there, especially
since I was wildly trying anything to get those stalled fixes to take.

Since then, I've pretty much redefined all my OSDs with fewer but larger
datastores and made them all managed. Now if I could just persuade the
auto-tuner to fix the PG sizes,

I'm in the process of opening a ticket account right now. The fun part
of this is that realistically, older docs need a re-write just as much
as the docs for the current release.

     Tim

On 10/31/24 15:39, Eugen Block wrote:
I completely understand your point of view. Our own main cluster is
also a bit "wild" in its OSD layout, that's why its OSDs are
"unmanaged" as well. When we adopted it via cephadm, I started to
create suitable osd specs for all those hosts and OSDs and I gave up.
:-D But since we sometimes also tend to experiment a bit, I rather
have full control over it. That's why we also have
osd_crush_initial_weight = 0, to check the OSD creation before letting
Ceph remap any PGs.

It definitely couldn't hurt to clarify the docs, you can always report
on tracker.ceph.com if you have any improvement ideas.

Zitat von Tim Holloway <timh@xxxxxxxxxxxxx>:

I have been slowly migrating towards spec files as I prefer
declarative management as a rule.

However, I think that we may have a dichotomy in the user base.

On the one hand, users with dozens/hundreds of server/drives of
basically identical character.

On the other, I'm one who's running fewer servers and for historical
reasons they tend to be wildly individualistic and often have blocks
of future-use space reserved for non-ceph storage.

Ceph, left to its own devices (no pun intended) can be quite
enthusiastic about adopting any storage it can find. And that's great
for users in the first category. Which is what the spec information
in the supplied links is emphasizing. But for us lesser creatures who
feel the need to manually control where each OSD and how it's
configured, it's not so simple. I'm fairly certain that there's
documentation on the spec file setup for that sort of stuff in the
online docs, but it's located somewhere else and I cannot recall
where.

At any rate I would consider it very important that the different
ways to setup an OSD should explicitly indicate which type of OSD
will be generated in their documentation.

    Tim

On 10/31/24 14:28, Eugen Block wrote:
Hi,

the preferred method to deploy OSDs in cephadm managed clusters are
spec files, see this part of the docs [0] for more information. I
would just not use the '--all-available-devices' flag, except in
test clusters, or if you're really sure that this is what you want.

If you use 'ceph orch daemon add osd ...', you'll end up with one
(or more) OSD(s), but they will be unmanaged, as you already noted
in your own cluster. There are a couple of examples with advanced
specs (e. g. DB/WAL on dedicated devices) in the docs as well [1].
So my recommendation would be to have a suiting spec file for your
disk layout. You can always check with the '--dry-run' flag before
actually applying it:

ceph orch apply -i osd-spec.yaml --dry-run

Regards,
Eugen

[0]
https://docs.ceph.com/en/latest/cephadm/services/osd/#deploy-osds
[1]

https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications
Zitat von Tim Holloway <timh@xxxxxxxxxxxxx>:

As I understand it, the manual OSD setup is only for legacy
(non-container) OSDs. Directory locations are wrong for managed
(containerized) OSDs, for one.

Actually, the whole manual setup docs ought to be moved out of the
mainline documentation. In their present arrangement, they make
legacy setup sound like the preferred method. And have you noticed
that there is no corresponding well-marked section titled
"Authomated (cephadmin) setup?".

This is how we end up with OSDs that are simultaneously legacy AND
administered for the same OSD, since at last count there are no
interlocks within Ceph to prevent such a mess.

    Tim

On 10/31/24 13:39, Dave Hall wrote:
Hello.

Sorry if it appears that I am reposting the same issue under a
different
topic.  However, I feel that the problem has moved and I now have
different
questions.

At this point I have, I believe, removed all traces of OSD.12 from
my
cluster - based on steps in the Reef docs at
https://docs.ceph.com/en/reef/rados/operations/add-or-rm-osds/#.
I have
further located and removed the WAL/DB LV on an associated NVMe
drive
(shared with 3 other OSDs).

I don't believe the instructions for replacing an OSD (ceph-volume
lvm
prepare) still apply, so I have been trying to work with the
instructions
under ADDING AN OSD (MANUAL).

However, since my installation is containerized (Podman), it is
unclear
which steps should be issued on the host and which within 'cephadm
shell'.

There is also another ambiguity:  In step 3 the instruction is to
'mkfs -t
{fstype}' and then to 'mount -o user_xattr'.  However, which fs
type?

After this, in step 4, the 'ceph-osd -i {osd-id} --mkfs --mkkey'
gets
throws errors about the keyring file.

So, are these the right instructions to be using in a containerized
installation?  Are there, in general, alternate documents for
containerized
installations?

Lastly, the above cited instructions don't say anything about the
separate
WAL/DB LV.

Please advise.

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdhall@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx