Re: converting legacy puppet-ceph configured OSDs to look like ceph-deployed OSDs

Mike Dawson <mike.dawson@xxxxxxxxxxxx> · Wed, 15 Oct 2014 17:04:01 -0400

On 10/15/2014 4:20 PM, Dan van der Ster wrote:
Hi Ceph users,

(sorry for the novel, but perhaps this might be useful for someone)

During our current project to upgrade our cluster from disks-only to
SSD journals, we've found it useful to convert our legacy puppet-ceph
deployed cluster (using something like the enovance module) to one that
looks like it has had its OSD created with ceph-disk prepare. It's been
educational for me, and I thought it would be good experience to share.

To start, the "old" puppet-ceph configures OSDs explicitly in
ceph.conf, like this:

[osd.211]
    host = p05151113489275
    devs = /dev/disk/by-path/pci-0000:02:00.0-sas-...-lun-0-part1

and ceph-disk list says this about the disks:

/dev/sdh :
  /dev/sdh1 other, xfs, mounted on /var/lib/ceph/osd/osd.211

In other words, ceph-disk doesn't know anything about the OSD living
on that disk.

Before deploying our SSD journals I was trying to find the best way to
map OSDs to SSD journal partitions (in puppet!), but basically there is
no good way to do this with the legacy puppet-ceph module. (What we'd
have to do is puppetize the partitioning of SSDs, then manually map OSDs
to SSD partitions. This would be tedious, and also error prone after
disk replacements and reboots).

However, I've found that by using ceph-deploy, i.e ceph-disk, to
prepare and activate OSDs, this becomes very simple, trivial even. Using
ceph-disk we keep the OSD/SSD mapping out of puppet; instead the state
is stored in the OSD itself. (1.5 years ago when we deployed this
cluster, ceph-deploy was advertised as quick tool to spin up small
clusters, so we didn't dare
use it. I realize now that it (or the puppet/chef/... recipes based on
it) is _the_only_way_ to build a cluster if you're starting out today.)

Now our problem was that I couldn't go and re-ceph-deploy the whole
cluster, since we've got some precious user data there. Instead, I
needed to learn how ceph-disk is labeling and preparing disks, and
modify our existing OSDs in place to look like they'd been prepared and
activated with ceph-disk.

In the end, I've worked out all the configuration and sgdisk magic and
put the recipes into a couple of scripts here [1]. Note that I do not
expect these to work for any other cluster unmodified. In fact, that
would be dangerous, so don't blame me if you break something. But they
might helpful for understanding how the ceph-disk udev magic works and
could be a basis for upgrading other clusters.

The scripts are:

ceph-deployifier/ceph-create-journals.sh:
   - this script partitions SSDs (assuming sda to sdd) with 5 partitions
each
   - the only trick is to add the partition name 'ceph journal' and set
the typecode to the magic JOURNAL_UUID along with a random partition guid

ceph-deployifier/ceph-label-disks.sh:
   - this script discovers the next OSD which is not prepared with
ceph-disk, finds an appropriate unused journal partition, and converts
the OSD to a ceph-disk prepared lookalike.
   - aside from the discovery part, the main magic is to:
     - create the files active, sysvinit and journal_uuid on the OSD
     - rename the partition to 'ceph data', set the typecode to the
magic OSD_UUID, and the partition guid to the OSD's uuid.
     - link to the /dev/disk/by-partuuid/ journal symlink, and make the
new journal
   - at the end, udev is triggered and the OSD is started (via the
ceph-disk activation magic)

The complete details are of course in the scripts. (I also have
another version of ceph-label-disks.sh that doesn't expect an SSD
journal but instead prepares the single disk 2 partitions scheme.)

After running these scripts you'll get a nice shiny ceph-disk list output:

/dev/sda :
  /dev/sda1 ceph journal, for /dev/sde1
  /dev/sda2 ceph journal, for /dev/sdf1
  /dev/sda3 ceph journal, for /dev/sdg1
...
/dev/sde :
  /dev/sde1 ceph data, active, cluster ceph, osd.2, journal /dev/sda1
/dev/sdf :
  /dev/sdf1 ceph data, active, cluster ceph, osd.8, journal /dev/sda2
/dev/sdg :
  /dev/sdg1 ceph data, active, cluster ceph, osd.12, journal /dev/sda3
...

And all of the udev magic is working perfectly. I've tested all of the
reboot, failed OSD, and failed SSD scenarios and it all works as it
should. And the puppet-ceph manifest for osd's is now just a very simple
wrapper around ceph-disk prepare. (I haven't published ours to github
yet, but it is very similar to the stackforge puppet-ceph manifest).

There you go, sorry that was so long. I hope someone finds this useful :)

Best Regards,
Dan

[1]
https://github.com/cernceph/ceph-scripts/tree/master/tools/ceph-deployifier

Dan,

Thank you for publishing this! I put some time into this very issue 
earlier this year, but got pulled in another direction before completing 
the work. I'd like to bring a production cluster deployed with mkcephfs 
out of the stone ages, so your work will be very useful to me.

Thanks again,
Mike Dawson

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com