Re: replace failed disk in Luminous v12.2.2

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Thu, 11 Jan 2018 17:47:50 +0100

Hi Alfredo,

thanks for your coments, see my answers inline.

On 01/11/2018 01:47 PM, Alfredo Deza wrote:
> On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
> <dietmar.rieder@xxxxxxxxxxx> wrote:
>> Hello,
>>
>> we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
>> get replaced.
>>
>> The cluster was initially deployed using ceph-deploy on Luminous
>> v12.2.0. The OSDs were created using
>>
>> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
>> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
>>
>> Note we separated the bluestore data, wal and db.
>>
>> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
>>
>> With the last update we also let ceph-volume take over the OSDs using
>> "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
>> simple activate ${osd} ${id}". All of this went smoothly.
> 
> That is good to hear!
> 
>>
>> Now wonder what is the correct way to replace a failed OSD block disk?
>>
>> The docs for luminous [1] say:
>>
>> REPLACING AN OSD
>>
>> 1. Destroy the OSD first:
>>
>> ceph osd destroy {id} --yes-i-really-mean-it
>>
>> 2. Zap a disk for the new OSD, if the disk was used before for other
>> purposes. It’s not necessary for a new disk:
>>
>> ceph-disk zap /dev/sdX
>>
>>
>> 3. Prepare the disk for replacement by using the previously destroyed
>> OSD id:
>>
>> ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`
>>
>>
>> 4. And activate the OSD:
>>
>> ceph-disk activate /dev/sdX1
>>
>>
>> Initially this seems to be straight forward, but....
>>
>> 1. I'm not sure if there is something to do with the still existing
>> bluefs db and wal partitions on the nvme device for the failed OSD. Do
>> they have to be zapped ? If yes, what is the best way? There is nothing
>> mentioned in the docs.
> 
> What is your concern here if the activation seems to work?

I geuss on the nvme partitions for bluefs db and bluefs wal there is
still data related to the failed OSD  block device. I was thinking that
this data might "interfere" with the new replacement OSD block device,
which is empty.

So you are saying that this is no concern, right?
Are they automatically reused and assigned to the replacement OSD block
device, or do I have to specify them when running ceph-disk prepare?
If I need to specify the wal and db partition, how is this done?

I'm asking this since from the logs of the initial cluster deployment I
got the following warning:

[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.db is not the same device as the osd data
[...]
[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.wal is not the same device as the osd data

>>
>> 2. Since we already let "ceph-volume simple" take over our OSDs I'm not
>> sure if we should now use ceph-volume or again ceph-disk (followed by
>> "ceph-vloume simple" takeover) to prepare and activate the OSD?
> 
> The `simple` sub-command is meant to help with the activation of OSDs
> at boot time, supporting ceph-disk (or manual) created OSDs.

OK, got this...

> 
> There is no requirement to use `ceph-volume lvm` which is intended for
> new OSDs using LVM as devices.

Fine...

>>
>> 3. If we should use ceph-volume, then by looking at the luminous
>> ceph-volume docs [2] I find for both,
>>
>> ceph-volume lvm prepare
>> ceph-volume lvm activate
>>
>> that the bluestore option is either NOT implemented or NOT supported
>>
>> activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
>> yet implemented)
>> prepare: [–bluestore] Use the bluestore objectstore (not currently
>> supported)
> 
> These might be a typo on the man page, will get that addressed. Ticket
> opened at http://tracker.ceph.com/issues/22663

Thanks

> bluestore as of 12.2.2 is fully supported and it is the default. The
> --help output in ceph-volume does have the flags updated and correctly
> showing this.

OK

>>
>>
>> So, now I'm completely lost. How is all of this fitting together in
>> order to replace a failed OSD?
> 
> You would need to keep using ceph-disk. Unless you want ceph-volume to
> take over, in which case you would need to follow the steps to deploy
> a new OSD
> with ceph-volume.

OK

> Note that although --osd-id is supported, there is an issue with that
> on 12.2.2 that would prevent you from correctly deploying it
> http://tracker.ceph.com/issues/22642
> 
> The recommendation, if you want to use ceph-volume, would be to omit
> --osd-id and let the cluster give you the ID.
> 
>>
>> 4. More.... after reading some a recent threads on this list additional
>> questions are coming up:
>>
>> According to the OSD replacement doc [1] :
>>
>> "When disks fail, [...], OSDs need to be replaced. Unlike Removing the
>> OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
>> keep -> kept] intact after the OSD is destroyed for replacement."
>>
>> but
>> http://tracker.ceph.com/issues/22642 seems to say that it is not
>> possible to reuse am OSD's id
> 
> That is a ceph-volume specific issue, unrelated to how replacement in
> Ceph works.

OK

>>
>>
>> So I'm quite lost with an essential and very basic seemingly simple task
>> of storage management.
> 
> You have two choices:
> 
> 1) keep using ceph-disk as always, even though you have "ported" your
> OSDs with `ceph-volume simple`
> 2) Deploy new OSDs with ceph-volume
> 
> For #1 you will want to keep running `simple` on newly deployed OSDs
> so that they can come up after a reboot, since `simple` disables the
> udev rules
> that caused activation with ceph-disk

OK, thanks so much for clarifying these thinks. I'll go for the
ceph-disk option then.

Just to be sure, these would be the steps I would do:

1.
ceph osd destroy osd.33 --yes-i-really-mean-it

2.
remove the failed HDD and replace it with a new HDD

3.
ceph-disk prepare --bluestore /dev/sdo  --osd-id osd.33

OR

do I need to specify the wal and db partitions on the nvme here like
Konstantin was suggesting in his answer to my question:

3.1. Find nvme partition for this OSD using ceph-disk, which gives me:

/dev/nvme1n1p2 ceph block.db
/dev/nvme1n1p3 ceph block.wal

3.2. Delete partition via parted or fdisk.

fdisk -u /dev/nvme1n1
d (delete partitions)
enter partition number of block.db: 2
d
enter partition number of block.wal: 3
w (write partition table)

3.3. run ceph-disk prepare

ceph-disk -v prepare --block.wal /dev/nvme1n1 --block.db /dev/nvme1n1 \
--bluestore /dev/sdo --osd-id osd.33

4.
Do I need to run "ceph-disk activate"?

ceph-disk activate /dev/sdo1

or any of the "ceph-volume simple" commands now?

or just start the osd with systemctl?

Thanks so much, and sorry for my igonrance ;-)

~Best
   Dietmar

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com