Re: replace failed disk in Luminous v12.2.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

can someone, comment/confirm my planned OSD replacement procedure?

It would be very helpful for me.

Dietmar

Am 11. Januar 2018 17:47:50 MEZ schrieb Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx>:
Hi Alfredo,

thanks for your coments, see my answers inline.

On 01/11/2018 01:47 PM, Alfredo Deza wrote:
On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
<dietmar.rieder@xxxxxxxxxxx> wrote:
Hello,

we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
get replaced.

The cluster was initially deployed using ceph-deploy on Luminous
v12.2.0. The OSDs were created using

ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
--block-wal /dev/nvme0n1 --block-db /dev/nvme0n1

Note we separated the bluestore data, wal and db.

We updated to Luminous v12.2.1 and further to Luminous v12.2.2.

With the last update we also let ceph-volume take over the OSDs using
"ceph-volume simple scan /var/lib/ceph/osd/$osd" and "ceph-volume
simple activate ${osd} ${id}". All of this went smoothly.

That is good to hear!


Now wonder what is the correct way to replace a failed OSD block disk?

The docs for luminous [1] say:

REPLACING AN OSD

1. Destroy the OSD first:

ceph osd destroy {id} --yes-i-really-mean-it

2. Zap a disk for the new OSD, if the disk was used before for other
purposes. It’s not necessary for a new disk:

ceph-disk zap /dev/sdX


3. Prepare the disk for replacement by using the previously destroyed
OSD id:

ceph-disk prepare --bluestore /dev/sdX --osd-id {id} --osd-uuid `uuidgen`


4. And activate the OSD:

ceph-disk activate /dev/sdX1


Initially this seems to be straight forward, but....

1. I'm not sure if there is something to do with the still existing
bluefs db and wal partitions on the nvme device for the failed OSD. Do
they have to be zapped ? If yes, what is the best way? There is nothing
mentioned in the docs.

What is your concern here if the activation seems to work?

I geuss on the nvme partitions for bluefs db and bluefs wal there is
still data related to the failed OSD block device. I was thinking that
this data might "interfere" with the new replacement OSD block device,
which is empty.

So you are saying that this is no concern, right?
Are they automatically reused and assigned to the replacement OSD block
device, or do I have to specify them when running ceph-disk prepare?
If I need to specify the wal and db partition, how is this done?

I'm asking this since from the logs of the initial cluster deployment I
got the following warning:

[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.db is not the same device as the osd data
[...]
[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.wal is not the same device as the osd data



2. Since we already let "ceph-volume simple" take over our OSDs I'm not
sure if we should now use ceph-volume or again ceph-disk (followed by
"ceph-vloume simple" takeover) to prepare and activate the OSD?

The `simple` sub-command is meant to help with the activation of OSDs
at boot time, supporting ceph-disk (or manual) created OSDs.

OK, got this...


There is no requirement to use `ceph-volume lvm` which is intended for
new OSDs using LVM as devices.

Fine...


3. If we should use ceph-volume, then by looking at the luminous
ceph-volume docs [2] I find for both,

ceph-volume lvm prepare
ceph-volume lvm activate

that the bluestore option is either NOT implemented or NOT supported

activate: [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
yet implemented)
prepare: [–bluestore] Use the bluestore objectstore (not currently
supported)

These might be a typo on the man page, will get that addressed. Ticket
opened at http://tracker.ceph.com/issues/22663

Thanks

bluestore as of 12.2.2 is fully supported and it is the default. The
--help output in ceph-volume does have the flags updated and correctly
showing this.

OK



So, now I'm completely lost. How is all of this fitting together in
order to replace a failed OSD?

You would need to keep using ceph-disk. Unless you want ceph-volume to
take over, in which case you would need to follow the steps to deploy
a new OSD
with ceph-volume.

OK

Note that although --osd-id is supported, there is an issue with that
on 12.2.2 that would prevent you from correctly deploying it
http://tracker.ceph.com/issues/22642

The recommendation, if you want to use ceph-volume, would be to omit
--osd-id and let the cluster give you the ID.


4. More.... after reading some a recent threads on this list additional
questions are coming up:

According to the OSD replacement doc [1] :

"When disks fail, [...], OSDs need to be replaced. Unlike Removing the
OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
keep -> kept] intact after the OSD is destroyed for replacement."

but
http://tracker.ceph.com/issues/22642 seems to say that it is not
possible to reuse am OSD's id

That is a ceph-volume specific issue, unrelated to how replacement in
Ceph works.

OK



So I'm quite lost with an essential and very basic seemingly simple task
of storage management.

You have two choices:

1) keep using ceph-disk as always, even though you have "ported" your
OSDs with `ceph-volume simple`
2) Deploy new OSDs with ceph-volume

For #1 you will want to keep running `simple` on newly deployed OSDs
so that they can come up after a reboot, since `simple` disables the
udev rules
that caused activation with ceph-disk

OK, thanks so much for clarifying these thinks. I'll go for the
ceph-disk option then.

Just to be sure, these would be the steps I would do:

1.
ceph osd destroy osd.33 --yes-i-really-mean-it

2.
remove the failed HDD and replace it with a new HDD

3.
ceph-disk prepare --bluestore /dev/sdo --osd-id osd.33

OR

do I need to specify the wal and db partitions on the nvme here like
Konstantin was suggesting in his answer to my question:

3.1. Find nvme partition for this OSD using ceph-disk, which gives me:

/dev/nvme1n1p2 ceph block.db
/dev/nvme1n1p3 ceph block.wal

3.2. Delete partition via parted or fdisk.

fdisk -u /dev/nvme1n1
d (delete partitions)
enter partition number of block.db: 2
d
enter partition number of block.wal: 3
w (write partition table)

3.3. run ceph-disk prepare

ceph-disk -v prepare --block.wal /dev/nvme1n1 --block.db /dev/nvme1n1 \
--bluestore /dev/sdo --osd-id osd.33

4.
Do I need to run "ceph-disk activate"?

ceph-disk activate /dev/sdo1

or any of the "ceph-volume simple" commands now?

or just start the osd with systemctl?

Thanks so much, and sorry for my igonrance ;-)

~Best
Dietmar

--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux