On 16-10-2018 13:58, Jan Fajerski wrote:
On Tue, Oct 16, 2018 at 01:10:02PM +0200, Willem Jan Withagen wrote:
On 16/10/2018 12:02, Jan Fajerski wrote:
On Mon, Oct 15, 2018 at 06:56:09AM -0500, Alfredo Deza wrote:
On Mon, Oct 15, 2018 at 6:48 AM Jan Fajerski <jfajerski@xxxxxxxx>
wrote:
Hi list,
while playing with ceph-volume I noticed that it adds the tag
ceph.data_device
to an lv with the name of the lv (at the time of calling prepare).
I was wondering what this specific tag is used for. From looking at
ceph-volume's code it seems its only ever set.
Using vgrename of lvrename one can easily create an inconsistency
in this
self-reference. Restarting the OSD (or rebooting the node) still
works as
expected but I'm certainly not thinking of all cases here.
The tags are used as a key/value store in the device, and we try to
add as much info there as possible. I think you are right that
we only set it (for now), but I can see how this could get us into
trouble if we ever depended on it.
A similar issue happens with the ephemeral names of other non-lv
devices, in which case we do update them.
If this doesn't serve a specific purpose I think we shouldn't set
the tag (happy
to push a PR).
I think the right thing to do would be to make sure that we have the
right LV and update it if that changes. This would help commands like
`ceph-volume lvm list` which
displays that information.
Would it make sense to change the implementation to simply return the
lv name on the fly instead of duplicating the information in an lvm
tag and trying to keep it consistent?
As starter:
I would consider it ill adviced to start changing these kinds of
nameing in the underlaying storage name....
Just because you can, it is not a reason to do so.
If its possible, someone will do it. And I'm wondering if we can make it
so that this kind of operation can be done without any issue...whats
wrong with that?
That is why I have my BOFH hat on...
On a complex system as Ceph is, you just don't go around and change
things without a very good reason. AND you know what you are doing.
It is just plain footshooting.
If you can rename a lv/vg, you can also mess up other things. And I'm
more with Alfredo that for the moment it perhaps does not serve a
purpose. But it might be just to restore the location after you just
renamed things.
You can also throw away pgs and or shards that are reported corrupt?
Probably is not a smart thing to do.
This is waht I register wiht ZFS:
osd.10/osd ceph:cluster_fsid ef485af8-9c2b-11e8-a98b-0025903744dc
osd.10/osd ceph:cluster_name ceph
osd.10/osd ceph:data_device osd.10 local
osd.10/osd ceph:osd_fsid 5f490ea4-96b0-4c9e-ae92-5f8893cf6a60
osd.10/osd ceph:crush_device_class None
osd.10/osd ceph:osd_id 10
osd.10/osd ceph:type data
Now I could start moving the data around, but then I do not have some
the device information I cannot destroy the object easily and have to
start toying with zpool/zfs/gpart/geom to reconstruct the whole
dependancy chain.
And note that ZFS makes it possible to replace a disk, without Ceph ever
detecting it. Just attach a disk in a mirror, scrub the vdev, and then
remove the old disk from the mirror. And that will get you a new fresh
copy of the old disk.
And I do not know enough of LVM, but could this information be used to
restore the correct lv/vg nameing after a serious loss of info about
the LVM layout?
And if you go around "just renaming" you should be knowledgeable to
understand that attributes need to be changed as well.
"It just works" is something I would consider a poor argument for this
case.
I'd argue the other way around. Why are we duplicating information (that
easily becomes inconsistent) that we then don't use. I'm arguing for not
writing this lvm tag, because we can just get that info on the fly for
e.g. 'ceph-volume lvm list'. We can identify lv's that are used by ceph
and just output that lv's name instead of relying on the tag content.
But as soon as you start moving out of the layout that was carefully
planned by ceph-volume all gloves are off.
In the ZFS case `data_device` is the linking pin between the partition
data and the fysical disk. Which you would need to update if you would
relabel a partition in gpart. If you don't, zapping a disk becomes a
dangerous manual process.
And I can imagine something like this on LVM as well.
--WjW
--WjW