Re: ceph octopus mysterious OSD crash

Stefan Kooman <stefan@xxxxxx> · Fri, 19 Mar 2021 06:09:28 +0100

On 3/19/21 2:20 AM, Philip Brown wrote:
yup cephadm and orch was used to set all this up.

Current state of things:

ceph osd tree shows

  33    hdd    1.84698              osd.33       destroyed         0  1.00000

^^ Destroyed, ehh, this doesn't look good to me. Ceph thinks this OSD is 
destroyed. Do you know what might have happened to osd.33? Did you 
perform a "kill an OSD" while testing?

AFAIK you can't fix that anymore. You will have to remove it and redploy 
it. Might even get a new osd.id.

cephadm logs --name osd.33 --fsid xx-xx-xx-xx

along with the systemctl stuff I already saw, showed me new things such as

ceph-osd[1645438]: did not load config file, using default settings.

ceph-osd[1645438]: 2021-03-18T14:31:32.990-0700 7f8bf14e3bc0 -1 parse_file: filesystem error: cannot get file size: No such file or directory

This suggested to me that I needed to copy over /etc/ceph/ceph.conf to the OSD node.
which I did.
I then also copied over the admin key and generated a fresh bootstrap-osd key with it, just for good measure, with
   ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

I had saved the previous output of ceph-volume lvm list
and on the OSD node, ran

ceph-volume lvm prepare --data xxxx --block.db xxxx

But it says osd is already prepared.

I tried an activate... it tells me

--> ceph-volume lvm activate successful for osd ID: 33

but now the cephadm logs output shows me

ceph-osd[1677135]: 2021-03-18T17:57:47.982-0700 7ff64593f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

Not the best error message :-}

Indeed, would be nice to have a references to [2]. But I think why you 
get this is because of the destroyed OSD. I would use cephadm docu on 
how to replace an osd. Does that exist? We add a large thread about this 
"container" topic (see " ceph-ansible in Pacific and beyond?").

Now what do I need to do?

I would remove osd.33. Even manually editing crushmaps if need to 
(should not be the case), and then redeploy this osd and wait for recovery.

If you have not manually "destroyed" this osd than either things work 
differently in Octopus from things I have seen so far, my memory is 
failing me, or some really weird stuff is happening and I would really 
like to know what that is.

Wat version are you running? Do note that 15.2.10 has been released.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx