Re: ceph octopus mysterious OSD crash

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 25 Mar 2021 14:04:17 -0500

As we wanted to verify this behavior with 15.2.10, we went ahead and
tested with a failed OSD. The drive was replaced, and we followed the
steps below (comments for clarity on our process) - this assumes you
have a service specification that will perform deployment once
matched:

# capture "db device" associated with OSD
ceph-volume list | less
# drain drive if possible, do this when planning replacement,
otherwise do once failure has occurred
ceph orch osd rm 391 --replace
# One drained (or if failure occurred), using "db device" path from
the ceph-volume list
lvremove /dev/ceph-blah/osd-db-blah
# monitor ceph for replacement
ceph -W cephadm
# once daemon has been deployed "TIMESTAMP mgr.cephXX.XXXXX [INF]
Deploying daemon osd.391 on cephXX", watch for rebalance to complete
ceph -s
--------------------
### consider increasing max_backfills if it's just a single drive replacement:
ceph config set osd osd_max_backfills 10
### if you do, after backfilling is complete:
ceph config rm osd osd_max_backfills

Following these steps, as soon as we completed the lvremove of the db
device in question, the OSD was rebuilt, and we verified a new
NVME-based db LV was created as per our specification:

service_type: osd
service_id: osd_spec_XXXXX
service_name: osd.osd_spec_XXXX
placement:
  host_pattern: '*'
spec:
  data_devices:
    rotational: 1
  db_devices:
    rotational: 0
  db_slots: 12
  filter_logic: AND
  objectstore: bluestore

Hope this helps out others in the future who need to deal with drive
replacements on cephadm/containerized deployments,
David

On Fri, Mar 19, 2021 at 4:57 PM David Orman <ormandj@xxxxxxxxxxxx> wrote:
>
> We also ran into a scenario in which I did exactly this, and it did
> _not_ work. It created the OSD, but did not put the DB/WAL on the NVME
> (didn't even create an LV). I'm wondering if there's some constraint
> applied (haven't looked at code yet) that when the NVME already has
> all but the one DB on it, it may not have the minimum space required
> (even though it's plenty based on the specification).
>
> Our service specification looks like this:
>
> service_type: osd
> service_id: osd_spec_test
> placement:
>   host_pattern: '*'
> data_devices:
>   rotational: 1
> db_devices:
>   rotational: 0
> db_slots: 12
>
> It works fine when fed an empty machine, but I've yet to get it to
> work when I've had an OSD fail, and I wipe out the LV for the DB and
> OSD. I'll get a new OSD, but no DB. On one of our clusters, due to the
> NVME sizing (800GB / 745.2G usable) + 24 OSDs the DBs (12 per NVME,
> two NVMEs per server) end up being ~62.1G, so there's about 62.1G free
> when we clear out the LV. I'm not sure why it doesn't 'do the right
> thing' and using that when spinning up the replaced OSD.
>
> I'm also curious what happens if two OSDs were to fail, you deleted
> two DBs, then added one OSD back. Would Ceph be smart enough to see
> the 12 slots per non-rotational in the osd specification and not
> allocate a 124.2G DB/WAL to that single OSD, preserving enough space
> for a second (for adding the second OSD later) - assuming this entire
> process worked as designed?
>
> David
>
> On Fri, Mar 19, 2021 at 4:20 PM Eugen Block <eblock@xxxxxx> wrote:
> >
> > I am quite sure that this case is covered by cephadm already. A few
> > months ago I tested it after a major rework of ceph-volume. I don’t
> > have any links right now. But I had a lab environment with multiple
> > OSDs per node with rocksDB on SSD and after wiping both HDD and DB LV
> > cephadm automatically redeployed the OSD according to my drive group
> > file.
> >
> >
> > Zitat von Stefan Kooman <stefan@xxxxxx>:
> >
> > > On 3/19/21 7:47 PM, Philip Brown wrote:
> > >
> > > I see.
> > >
> > >>
> > >> I dont think it works when 7/8 devices are already configured, and
> > >> the SSD is already mostly sliced.
> > >
> > > OK. If it is a test cluster you might just blow it all away. By
> > > doing this you are simulating a "SSD" failure taking down all HDDs
> > > with it. It sure isn't pretty. I would say the situation you ended
> > > up with is not a corner case by any means. I am afraid I would
> > > really need to set up a test cluster with cephadm to help you
> > > further at this point, besides the suggestion above.
> > >
> > > Gr. Stefan
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx