Re: replace failed disk in Luminous v12.2.2

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Fri, 12 Jan 2018 16:08:23 +0100

Hi,

can someone, comment/confirm my planned OSD replacement procedure?

It would be very helpful for me.

Dietmar

Am 11. Januar 2018 17:47:50 MEZ schrieb Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx>:
Hi Alfredo,

thanks for your coments, see my answers inline.

On 01/11/2018 01:47 PM, Alfredo Deza wrote:
 On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
 <dietmar.rieder@xxxxxxxxxxx> wrote:
 Hello,

 we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
 get replaced.

 The cluster was initially deployed using ceph-deploy on Luminous
 v12.2.0. The OSDs were created using

 ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
 --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1

 Note we separated the bluestore data, wal and db.

 We updated to Luminous v12.2.1 and further to Luminous v12.2.2.

 With the last update we also let ceph-volume take over the OSDs using
 "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
 simple activate ${osd} ${id}". All of this went smoothly.

 That is good to hear!

 Now wonder what is the correct way to replace a failed OSD block disk?

 The docs for luminous [1] say:

 REPLACING AN OSD

 1. Destroy the OSD first:

 ceph osd destroy {id} --yes-i-really-mean-it

 2. Zap a disk for the new OSD, if the disk was used before for other
 purposes. It’s not necessary for a new disk:

 ceph-disk zap /dev/sdX

 3. Prepare the disk for replacement by using the previously destroyed
 OSD id:

 ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`

 4. And activate the OSD:

 ceph-disk activate /dev/sdX1

 Initially this seems to be straight forward, but....

 1. I'm not sure if there is something to do with the still existing
 bluefs db and wal partitions on the nvme device for the failed OSD. Do
 they have to be zapped ? If yes, what is the best way? There is nothing
 mentioned in the docs.

 What is your concern here if the activation seems to work?

I geuss on the nvme partitions for bluefs db and bluefs wal there is
still data related to the failed OSD  block device. I was thinking that
this data might "interfere" with the new replacement OSD block device,
which is empty.

So you are saying that this is no concern, right?
Are they automatically reused and assigned to the replacement OSD block
device, or do I have to specify them when running ceph-disk prepare?
If I need to specify the wal and db partition, how is this done?

I'm asking this since from the logs of the initial cluster deployment I
got the following warning:

[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.db is not the same device as the osd data
[...]
[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.wal is not the same device as the osd data

 2. Since we already let "ceph-volume simple" take over our OSDs I'm not
 sure if we should now use ceph-volume or again ceph-disk (followed by
 "ceph-vloume simple" takeover) to prepare and activate the OSD?

 The `simple` sub-command is meant to help with the activation of OSDs
 at boot time, supporting ceph-disk (or manual) created OSDs.

OK, got this...

 There is no requirement to use `ceph-volume lvm` which is intended for
 new OSDs using LVM as devices.

Fine...

 3. If we should use ceph-volume, then by looking at the luminous
 ceph-volume docs [2] I find for both,

 ceph-volume lvm prepare
 ceph-volume lvm activate

 that the bluestore option is either NOT implemented or NOT supported

 activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
 yet implemented)
 prepare: [–bluestore] Use the bluestore objectstore (not currently
 supported)

 These might be a typo on the man page, will get that addressed. Ticket
 opened at http://tracker.ceph.com/issues/22663

Thanks

 bluestore as of 12.2.2 is fully supported and it is the default. The
 --help output in ceph-volume does have the flags updated and correctly
 showing this.

OK

 So, now I'm completely lost. How is all of this fitting together in
 order to replace a failed OSD?

 You would need to keep using ceph-disk. Unless you want ceph-volume to
 take over, in which case you would need to follow the steps to deploy
 a new OSD
 with ceph-volume.

OK

 Note that although --osd-id is supported, there is an issue with that
 on 12.2.2 that would prevent you from correctly deploying it
 http://tracker.ceph.com/issues/22642

 The recommendation, if you want to use ceph-volume, would be to omit
 --osd-id and let the cluster give you the ID.

 4. More.... after reading some a recent threads on this list additional
 questions are coming up:

 According to the OSD replacement doc [1] :

 "When disks fail, [...], OSDs need to be replaced. Unlike Removing the
 OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
 keep -> kept] intact after the OSD is destroyed for replacement."

 but
 http://tracker.ceph.com/issues/22642 seems to say that it is not
 possible to reuse am OSD's id

 That is a ceph-volume specific issue, unrelated to how replacement in
 Ceph works.

OK

 So I'm quite lost with an essential and very basic seemingly simple task
 of storage management.

 You have two choices:

 1) keep using ceph-disk as always, even though you have "ported" your
 OSDs with `ceph-volume simple`
 2) Deploy new OSDs with ceph-volume

 For #1 you will want to keep running `simple` on newly deployed OSDs
 so that they can come up after a reboot, since `simple` disables the
 udev rules
 that caused activation with ceph-disk

OK, thanks so much for clarifying these thinks. I'll go for the
ceph-disk option then.

Just to be sure, these would be the steps I would do:

1.
ceph osd destroy osd.33 --yes-i-really-mean-it

2.
remove the failed HDD and replace it with a new HDD

3.
ceph-disk prepare --bluestore /dev/sdo  --osd-id osd.33

OR

do I need to specify the wal and db partitions on the nvme here like
Konstantin was suggesting in his answer to my question:

3.1. Find nvme partition for this OSD using ceph-disk, which gives me:

/dev/nvme1n1p2 ceph block.db
/dev/nvme1n1p3 ceph block.wal

3.2. Delete partition via parted or fdisk.

fdisk -u /dev/nvme1n1
d (delete partitions)
enter partition number of block.db: 2
d
enter partition number of block.wal: 3
w (write partition table)

3.3. run ceph-disk prepare

ceph-disk -v prepare --block.wal /dev/nvme1n1 --block.db /dev/nvme1n1 \
--bluestore /dev/sdo --osd-id osd.33

4.
Do I need to run "ceph-disk activate"?

ceph-disk activate /dev/sdo1

or any of the "ceph-volume simple" commands now?

or just start the osd with systemctl?

Thanks so much, and sorry for my igonrance ;-)

~Best
   Dietmar

-- 

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet._______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com