On Tue, Feb 20, 2018 at 9:05 PM, Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: > Many thanks for your replies! > > Am 21.02.2018 um 02:20 schrieb Alfredo Deza: >> On Tue, Feb 20, 2018 at 5:56 PM, Oliver Freyermuth >> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote: >>> Dear Cephalopodians, >>> >>> with the release of ceph-deploy we are thinking about migrating our Bluestore-OSDs (currently created with ceph-disk via old ceph-deploy) >>> to be created via ceph-volume (with LVM). >> >> When you say migrating, do you mean creating them again from scratch >> or making ceph-volume take over the previously created OSDs >> (ceph-volume can do both) > > I would recreate from scratch to switch to LVM, we have a k=4 m=2 EC-pool with 6 hosts, so I can just take down a full host and recreate. > But good to know both would work! > >> >>> >>> I note two major changes: >>> 1. It seems the block.db partitions have to be created beforehand, manually. >>> With ceph-disk, one should not do that - or manually set the correct PARTTYPE ID. >>> Will ceph-volume take care of setting the PARTTYPE on existing partitions for block.db now? >>> Is it not necessary anymore? >>> Is the config option bluestore_block_db_size now also obsoleted? >> >> Right, ceph-volume will not create any partitions for you, so no, it >> will not take care of setting PARTTYPE either. If your setup requires >> a block.db, then this must be >> created beforehand and then passed onto ceph-volume. The one >> requirement if it is a partition is to have a PARTUUID. For logical >> volumes it can just work as-is. This is >> explained in detail at >> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#bluestore >> >> PARTUUID information for ceph-volume at: >> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#partitioning > > Ok. > So do I understand correctly that the PARTTYPE setting (i.e. those magic numbers as found e.g. in ceph-disk sources in PTYPE: > https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L62 ) > is not needed anymore for the block.db partitions, since it was effectively only there > to have udev work? Right, the PARTTYPE was only for udev. We need the PARTUUID to ensure that we can find the right device/partition always (in the case of partitions only) > > I remember from ceph-disk that if I created the block.db partition beforehand and without setting the magic PARTTYPE, > it would become unhappy. > ceph-volume and the systemd activation path should not care at all if I understand this correctly. Right again, this was part of the complex set of things that a partition had to have in order for ceph-disk to work. A lot of users thought that the partition approach was simple enough, but without being aware that a lot of extra things were needed for those partitions to be recognized by ceph-disk > > So in short, to create a new OSD, steps for me would be: > - Create block.db partition (and don't care about PARTTYPE). > I do only have to make sure it has a PARTUUID. > - ceph-volume lvm create --bluestore --block.db /dev/sdag1 --data /dev/sda > (or the same via ceph-deploy) That would work, yes. When you pass a whole device to --data in that example, ceph-volume will create a whole volume group and logical volume out of that device and use it for bluestore. That may or may not be what you want though. With LVM you can chop that device in many pieces and use what you want. That "shim" in ceph-volume is there to allow users that don't care about this to just move forward with a whole device. Similarly, if you want to use a logical volume for --block.db, you can. To recap: yes, your example would work, but you have a lot of other options if you need more flexibility. > > >>> >>> 2. Activation does not work via udev anymore, which solves some racy things. >>> >>> This second major change makes me curious: How does activation work now? >>> In the past, I could reinstall the full OS, install ceph packages, trigger udev / reboot and all OSDs would come back, >>> without storing any state / activating any services in the OS. >> >> Activation works via systemd. This is explained in detail here >> http://docs.ceph.com/docs/master/ceph-volume/lvm/activate >> >> Nothing with `ceph-volume lvm` requires udev for discovery. If you >> need to re-install the OS and recover your OSDs all you need to do is >> to >> re-activate them. You would need to know what the ID and UUID of the OSDs is. >> >> If you don't have that information handy, you can run: >> >> ceph-volume lvm list >> >> And all the information will be available. This will persist even on >> system re-installs > > Understood - so indeed the manual step would be to run "list" and then activate the OSDs one-by-one > to re-create the service files. > More cumbersome than letting udev do it's thing, but it certainly gives more control, > so it seems preferrable. > > Are there plans to have something like > "ceph-volume discover-and-activate" > which would effectively do something like: > ceph-volume list and activate all OSDs which are re-discovered from LVM metadata? This is a good idea, I think ceph-disk had an 'activate all', and it would make it easier for the situation you explain with ceph-volume I've created http://tracker.ceph.com/issues/23067 to follow up on this an implement it. > > This would largely simplify OS reinstalls (otherwise I'll likely write a small shell script to do exactly that), > and as far as I understand, activating an already activated OSD should be harmless (it should only re-enable > an already enabled service file). > >> >>> >>> Does this still work? >>> Or is there a manual step needed to restore the ceph-osd@ID-UUID services which at first glance appear to store state (namely ID and UUID)? >> >> The manual step would be to call activate as described here >> http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#new-osds >>> >>> If that's the case: >>> - What is this magic manual step? >> >> Linked above >> >>> - Is it still possible to flip two disks within the same OSD host without issues? >> >> What do you mean by "flip" ? > > Sorry, I was unclear on this. I meant exchanging two harddrives with each other within a single OSD host, > e.g. /dev/sda => /dev/sdc and /dev/sdc => /dev/sda (for controller weirdness or whatever reason). > If I understand correctly, this should not be a problem at all, since OSD-ID and PARTUUID are unaffected by that > (as you write, LVM metadata will persist with the device). We are fully resilient to non-persistent device name changes. LVM has the ability to ensure that the LV will keep track of the correct device, and we are capturing the PARTUUID to detect devices and storing it in LVM metadata. Even further: when devices have changed, and the device name we stored initially is stale, and a user runs `ceph-volume lvm list` the scan will update the results with the new (or updated) device name information, and update the LVM metadata that stores that piece of information to reflect that. > > Many thanks again for this very exensive reply! > > >> >>> I would guess so, since the services would detect the disk in the ceph-volume trigger phase. >>> - Is it still possible to take a disk from one OSD host, and put it in another one, or does this now need a manual interaction? >>> With ceph-disk / udev, it did not, since udev triggered disk activation and then the service was created at runtime. >> >> It is technically possible, the lvm part of it was built with this in >> mind. The LVM metadata will persist with the device, so this is not a >> problem. Just manual activation would be needed. >> >>> >>> Many thanks for your help and cheers, >>> Oliver >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com