Re: ceph octopus mysterious OSD crash

David Orman <ormandj@xxxxxxxxxxxx> · Fri, 19 Mar 2021 16:57:48 -0500

We also ran into a scenario in which I did exactly this, and it did
_not_ work. It created the OSD, but did not put the DB/WAL on the NVME
(didn't even create an LV). I'm wondering if there's some constraint
applied (haven't looked at code yet) that when the NVME already has
all but the one DB on it, it may not have the minimum space required
(even though it's plenty based on the specification).

Our service specification looks like this:

service_type: osd
service_id: osd_spec_test
placement:
  host_pattern: '*'
data_devices:
  rotational: 1
db_devices:
  rotational: 0
db_slots: 12

It works fine when fed an empty machine, but I've yet to get it to
work when I've had an OSD fail, and I wipe out the LV for the DB and
OSD. I'll get a new OSD, but no DB. On one of our clusters, due to the
NVME sizing (800GB / 745.2G usable) + 24 OSDs the DBs (12 per NVME,
two NVMEs per server) end up being ~62.1G, so there's about 62.1G free
when we clear out the LV. I'm not sure why it doesn't 'do the right
thing' and using that when spinning up the replaced OSD.

I'm also curious what happens if two OSDs were to fail, you deleted
two DBs, then added one OSD back. Would Ceph be smart enough to see
the 12 slots per non-rotational in the osd specification and not
allocate a 124.2G DB/WAL to that single OSD, preserving enough space
for a second (for adding the second OSD later) - assuming this entire
process worked as designed?

David

On Fri, Mar 19, 2021 at 4:20 PM Eugen Block <eblock@xxxxxx> wrote:
>
> I am quite sure that this case is covered by cephadm already. A few
> months ago I tested it after a major rework of ceph-volume. I don’t
> have any links right now. But I had a lab environment with multiple
> OSDs per node with rocksDB on SSD and after wiping both HDD and DB LV
> cephadm automatically redeployed the OSD according to my drive group
> file.
>
>
> Zitat von Stefan Kooman <stefan@xxxxxx>:
>
> > On 3/19/21 7:47 PM, Philip Brown wrote:
> >
> > I see.
> >
> >>
> >> I dont think it works when 7/8 devices are already configured, and
> >> the SSD is already mostly sliced.
> >
> > OK. If it is a test cluster you might just blow it all away. By
> > doing this you are simulating a "SSD" failure taking down all HDDs
> > with it. It sure isn't pretty. I would say the situation you ended
> > up with is not a corner case by any means. I am afraid I would
> > really need to set up a test cluster with cephadm to help you
> > further at this point, besides the suggestion above.
> >
> > Gr. Stefan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx