Re: ceph octopus mysterious OSD crash

Philip Brown <pbrown@xxxxxxxxxx> · Thu, 18 Mar 2021 18:20:56 -0700 (PDT)

yup cephadm and orch was used to set all this up.

Current state of things:

ceph osd tree shows

 33    hdd    1.84698              osd.33       destroyed         0  1.00000

cephadm logs --name osd.33 --fsid xx-xx-xx-xx

along with the systemctl stuff I already saw, showed me new things such as

ceph-osd[1645438]: did not load config file, using default settings.

ceph-osd[1645438]: 2021-03-18T14:31:32.990-0700 7f8bf14e3bc0 -1 parse_file: filesystem error: cannot get file size: No such file or directory

This suggested to me that I needed to copy over /etc/ceph/ceph.conf to the OSD node.
which I did.
I then also copied over the admin key and generated a fresh bootstrap-osd key with it, just for good measure, with
  ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

I had saved the previous output of ceph-volume lvm list
and on the OSD node, ran

ceph-volume lvm prepare --data xxxx --block.db xxxx

But it says osd is already prepared.

I tried an activate... it tells me

--> ceph-volume lvm activate successful for osd ID: 33

but now the cephadm logs output shows me

ceph-osd[1677135]: 2021-03-18T17:57:47.982-0700 7ff64593f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]

Not the best error message :-}

Now what do I need to do?

----- Original Message -----
From: "Stefan Kooman" <stefan@xxxxxx>
To: "Philip Brown" <pbrown@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
Sent: Thursday, March 18, 2021 2:04:09 PM
Subject: Re:  ceph octopus mysterious OSD crash

On 3/18/21 9:28 PM, Philip Brown wrote:
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs.
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition.
> 
> service_type: osd
> service_id: osd_spec_default
> placement:
>    host_pattern: '*'
> data_devices:
>    rotational: 1
> db_devices:
>    rotational: 0
> 
> 
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were "down".
> 
> I went to check the logs, with
> journalctl -u ceph-xxxx@xxxxxxx
> 
> all it showed were a bunch of generic debug info, and the fact that it stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
> 

It's a deployment made with cephadm? Looks like it as I see podman 
messages. Are these all the log messages you can find on those OSDs? 
I.e. have you tried to gather logs with cephadm logs [1].

Gr. Stefan

[1]: 
https://docs.ceph.com/en/latest/cephadm/troubleshooting/#gathering-log-files
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx