Re: ceph-volume lvm tag ceph.data_device

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 16 Oct 2018 14:56:37 +0200

On 16-10-2018 13:58, Jan Fajerski wrote:
On Tue, Oct 16, 2018 at 01:10:02PM +0200, Willem Jan Withagen wrote:
On 16/10/2018 12:02, Jan Fajerski wrote:
On Mon, Oct 15, 2018 at 06:56:09AM -0500, Alfredo Deza wrote:
On Mon, Oct 15, 2018 at 6:48 AM Jan Fajerski <jfajerski@xxxxxxxx> 
wrote:

Hi list,
while playing with ceph-volume I noticed that it adds the tag 
ceph.data_device
to an lv with the name of the lv (at the time of calling prepare).
I was wondering what this specific tag is used for. From looking at
ceph-volume's code it seems its only ever set.
Using vgrename of lvrename one can easily create an inconsistency 
in this
self-reference. Restarting the OSD (or rebooting the node) still 
works as
expected but I'm certainly not thinking of all cases here.

The tags are used as a key/value store in the device, and we try to
add as much info there as possible. I think you are right that
we only set it (for now), but I can see how this could get us into
trouble if we ever depended on it.

A similar issue happens with the ephemeral names of other non-lv
devices, in which case we do update them.

If this doesn't serve a specific purpose I think we shouldn't set 
the tag (happy
to push a PR).

I think the right thing to do would be to make sure that we have the
right LV and update it if that changes. This would help commands like
`ceph-volume lvm list` which
displays that information.

Would it make sense to change the implementation to simply return the 
lv name on the fly instead of duplicating the information in an lvm 
tag and trying to keep it consistent?

As starter:
I would consider it ill adviced to start changing these kinds of 
nameing in the underlaying storage name....
Just because you can, it is not a reason to do so.
If its possible, someone will do it. And I'm wondering if we can make it 
so that this kind of operation can be done without any issue...whats 
wrong with that?

That is why I have my BOFH hat on...

On a complex system as Ceph is, you just don't go around and change 
things without a very good reason. AND you know what you are doing.
It is just plain footshooting.

If you can rename a lv/vg, you can also mess up other things. And I'm 
more with Alfredo that for the moment it perhaps does not serve a 
purpose. But it might be just to restore the location after you just 
renamed things.

You can also throw away pgs and or shards that are reported corrupt?
Probably is not a smart thing to do.

This is waht I register wiht ZFS:
osd.10/osd  ceph:cluster_fsid  ef485af8-9c2b-11e8-a98b-0025903744dc 
osd.10/osd  ceph:cluster_name  ceph 
osd.10/osd  ceph:data_device   osd.10                                local
osd.10/osd  ceph:osd_fsid      5f490ea4-96b0-4c9e-ae92-5f8893cf6a60 
osd.10/osd  ceph:crush_device_class  None 
  osd.10/osd  ceph:osd_id        10 
osd.10/osd  ceph:type          data

Now I could start moving the data around, but then I do not have some 
the device information I cannot destroy the object easily and have to 
start toying with zpool/zfs/gpart/geom to reconstruct the whole 
dependancy chain.
And note that ZFS makes it possible to replace a disk, without Ceph ever 
detecting it. Just attach a disk in a mirror, scrub the vdev, and then 
remove the old disk from the mirror. And that will get you a new fresh 
copy of the old disk.

And I do not know enough of LVM, but could this information be used to 
restore the correct lv/vg nameing after a serious loss of info about 
the LVM layout?

And if you go around "just renaming" you should be knowledgeable to 
understand that attributes need to be changed as well.
"It just works" is something I would consider a poor argument for this 
case.
I'd argue the other way around. Why are we duplicating information (that 
easily becomes inconsistent) that we then don't use. I'm arguing for not 
writing this lvm tag, because we can just get that info on the fly for 
e.g. 'ceph-volume lvm list'. We can identify lv's that are used by ceph 
and just output that lv's name instead of relying on the tag content.

But as soon as you start moving out of the layout that was carefully 
planned by ceph-volume all gloves are off.

In the ZFS case `data_device` is the linking pin between the partition 
data and the fysical disk. Which you would need to update if you would 
relabel a partition in gpart. If you don't, zapping a disk becomes a 
dangerous manual process.

And I can imagine something like this on LVM as well.

--WjW

--WjW