As we wanted to verify this behavior with 15.2.10, we went ahead and tested with a failed OSD. The drive was replaced, and we followed the steps below (comments for clarity on our process) - this assumes you have a service specification that will perform deployment once matched: # capture "db device" associated with OSD ceph-volume list | less # drain drive if possible, do this when planning replacement, otherwise do once failure has occurred ceph orch osd rm 391 --replace # One drained (or if failure occurred), using "db device" path from the ceph-volume list lvremove /dev/ceph-blah/osd-db-blah # monitor ceph for replacement ceph -W cephadm # once daemon has been deployed "TIMESTAMP mgr.cephXX.XXXXX [INF] Deploying daemon osd.391 on cephXX", watch for rebalance to complete ceph -s -------------------- ### consider increasing max_backfills if it's just a single drive replacement: ceph config set osd osd_max_backfills 10 ### if you do, after backfilling is complete: ceph config rm osd osd_max_backfills Following these steps, as soon as we completed the lvremove of the db device in question, the OSD was rebuilt, and we verified a new NVME-based db LV was created as per our specification: service_type: osd service_id: osd_spec_XXXXX service_name: osd.osd_spec_XXXX placement: host_pattern: '*' spec: data_devices: rotational: 1 db_devices: rotational: 0 db_slots: 12 filter_logic: AND objectstore: bluestore Hope this helps out others in the future who need to deal with drive replacements on cephadm/containerized deployments, David On Fri, Mar 19, 2021 at 4:57 PM David Orman <ormandj@xxxxxxxxxxxx> wrote: > > We also ran into a scenario in which I did exactly this, and it did > _not_ work. It created the OSD, but did not put the DB/WAL on the NVME > (didn't even create an LV). I'm wondering if there's some constraint > applied (haven't looked at code yet) that when the NVME already has > all but the one DB on it, it may not have the minimum space required > (even though it's plenty based on the specification). > > Our service specification looks like this: > > service_type: osd > service_id: osd_spec_test > placement: > host_pattern: '*' > data_devices: > rotational: 1 > db_devices: > rotational: 0 > db_slots: 12 > > It works fine when fed an empty machine, but I've yet to get it to > work when I've had an OSD fail, and I wipe out the LV for the DB and > OSD. I'll get a new OSD, but no DB. On one of our clusters, due to the > NVME sizing (800GB / 745.2G usable) + 24 OSDs the DBs (12 per NVME, > two NVMEs per server) end up being ~62.1G, so there's about 62.1G free > when we clear out the LV. I'm not sure why it doesn't 'do the right > thing' and using that when spinning up the replaced OSD. > > I'm also curious what happens if two OSDs were to fail, you deleted > two DBs, then added one OSD back. Would Ceph be smart enough to see > the 12 slots per non-rotational in the osd specification and not > allocate a 124.2G DB/WAL to that single OSD, preserving enough space > for a second (for adding the second OSD later) - assuming this entire > process worked as designed? > > David > > On Fri, Mar 19, 2021 at 4:20 PM Eugen Block <eblock@xxxxxx> wrote: > > > > I am quite sure that this case is covered by cephadm already. A few > > months ago I tested it after a major rework of ceph-volume. I don’t > > have any links right now. But I had a lab environment with multiple > > OSDs per node with rocksDB on SSD and after wiping both HDD and DB LV > > cephadm automatically redeployed the OSD according to my drive group > > file. > > > > > > Zitat von Stefan Kooman <stefan@xxxxxx>: > > > > > On 3/19/21 7:47 PM, Philip Brown wrote: > > > > > > I see. > > > > > >> > > >> I dont think it works when 7/8 devices are already configured, and > > >> the SSD is already mostly sliced. > > > > > > OK. If it is a test cluster you might just blow it all away. By > > > doing this you are simulating a "SSD" failure taking down all HDDs > > > with it. It sure isn't pretty. I would say the situation you ended > > > up with is not a corner case by any means. I am afraid I would > > > really need to set up a test cluster with cephadm to help you > > > further at this point, besides the suggestion above. > > > > > > Gr. Stefan > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx