converting legacy puppet-ceph configured OSDs to look like ceph-deployed OSDs

Dan van der Ster <daniel.vanderster@xxxxxxx> · Wed, 15 Oct 2014 22:20:40 +0200

Hi Ceph users,

(sorry for the novel, but perhaps this might be useful for someone)

During our current project to upgrade our cluster from disks-only to SSD journals, we've found it useful to convert our legacy puppet-ceph deployed cluster (using something like the enovance module) to one that looks like it has had its OSD created with ceph-disk prepare. It's been educational for me, and I thought it would be good experience to share.

To start, the "old" puppet-ceph configures OSDs explicitly in ceph.conf, like this:

[osd.211]
   host = p05151113489275
   devs = /dev/disk/by-path/pci-0000:02:00.0-sas-...-lun-0-part1

and ceph-disk list says this about the disks:

/dev/sdh :
 /dev/sdh1 other, xfs, mounted on /var/lib/ceph/osd/osd.211

In other words, ceph-disk doesn't know anything about the OSD living on that disk.

Before deploying our SSD journals I was trying to find the best way to map OSDs to SSD journal partitions (in puppet!), but basically there is no good way to do this with the legacy puppet-ceph module. (What we'd have to do is puppetize the partitioning of SSDs, then manually map OSDs to SSD partitions. This would be tedious, and also error prone after disk replacements and reboots).

However, I've found that by using ceph-deploy, i.e ceph-disk, to prepare and activate OSDs, this becomes very simple, trivial even. Using ceph-disk we keep the OSD/SSD mapping out of puppet; instead the state is stored in the OSD itself. (1.5 years ago when we deployed this cluster, ceph-deploy was advertised as quick tool to spin up small clusters, so we didn't dare
use it. I realize now that it (or the puppet/chef/... recipes based on it) is _the_only_way_ to build a cluster if you're starting out today.)

Now our problem was that I couldn't go and re-ceph-deploy the whole cluster, since we've got some precious user data there. Instead, I needed to learn how ceph-disk is labeling and preparing disks, and modify our existing OSDs in place to look like they'd been prepared and activated with ceph-disk.

In the end, I've worked out all the configuration and sgdisk magic and put the recipes into a couple of scripts here [1]. Note that I do not expect these to work for any other cluster unmodified. In fact, that would be dangerous, so don't blame me if you break something. But they might helpful for understanding how the ceph-disk udev magic works and could be a basis for upgrading other clusters.

The scripts are:

ceph-deployifier/ceph-create-journals.sh:
  - this script partitions SSDs (assuming sda to sdd) with 5 partitions each
  - the only trick is to add the partition name 'ceph journal' and set the typecode to the magic JOURNAL_UUID along with a random partition guid

ceph-deployifier/ceph-label-disks.sh:
  - this script discovers the next OSD which is not prepared with ceph-disk, finds an appropriate unused journal partition, and converts the OSD to a ceph-disk prepared lookalike.
  - aside from the discovery part, the main magic is to:
    - create the files active, sysvinit and journal_uuid on the OSD
    - rename the partition to 'ceph data', set the typecode to the magic OSD_UUID, and the partition guid to the OSD's uuid.
    - link to the /dev/disk/by-partuuid/ journal symlink, and make the new journal
  - at the end, udev is triggered and the OSD is started (via the ceph-disk activation magic)

The complete details are of course in the scripts. (I also have another version of ceph-label-disks.sh that doesn't expect an SSD journal but instead prepares the single disk 2 partitions scheme.)

After running these scripts you'll get a nice shiny ceph-disk list output:

/dev/sda :
 /dev/sda1 ceph journal, for /dev/sde1
 /dev/sda2 ceph journal, for /dev/sdf1
 /dev/sda3 ceph journal, for /dev/sdg1
...
/dev/sde :
 /dev/sde1 ceph data, active, cluster ceph, osd.2, journal /dev/sda1
/dev/sdf :
 /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sda2
/dev/sdg :
 /dev/sdg1 ceph data, active, cluster ceph, osd.12, journal /dev/sda3
...

And all of the udev magic is working perfectly. I've tested all of the reboot, failed OSD, and failed SSD scenarios and it all works as it should. And the puppet-ceph manifest for osd's is now just a very simple wrapper around ceph-disk prepare. (I haven't published ours to github yet, but it is very similar to the stackforge puppet-ceph manifest).

There you go, sorry that was so long. I hope someone finds this useful :)

Best Regards,
Dan

[1] https://github.com/cernceph/ceph-scripts/tree/master/tools/ceph-deployifier
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com