Jewel ubuntu release is half cooked

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Mon, 23 May 2016 11:26:38 +0100 (BST)

Hello

I've recently updated my Hammer ceph cluster running on Ubuntu 14.04 LTS servers and noticed a few issues during the upgrade. Just wanted to share my experience.

I've installed the latest Jewel release. In my opinion, some of the issues I came across relate to poor upgrade documentation instructions, others to inconsistencies in the ubuntu package. Here are the issues i've picked up (I've followed the release notes upgrade procedure):

1. Ceph journals - After performing the upgrade the ceph-osd processes are not starting. I've followed the instructions and chowned /var/lib/ceph (also see point 2 below). The issue relates to the journal partitions, which are not chowned due to the symlinks. Thus, the ceph user had no read/write access to the journal partitions. IMHO, this should be addressed at the documentation layer unless it can be easily and reliably dealt with by the installation script.

2. Inefficient chown documentation -  The documentation states that one should "chown -R ceph:ceph /var/lib/ceph" if one is looking to have ceph-osd ran as user ceph and not as root. Now, this command would run a chown process one osd at a time. I am considering my cluster to be a fairly small cluster with just 30 osds between 3 osd servers. It takes about 60 minutes to run the chown command on each osd (3TB disks with about 60% usage). It would take about 10 hours to complete this command on each osd server, which is just mad in my opinion. I can't imagine this working well at all on servers with 20-30 osds! IMHO the docs should be adjusted to instruct users to run the chown in _parallel_ on all osds instead of doing it one by one.

In addition, the documentation does not mention the issues with journals, which I think is a big miss. In the end, I had to hack a quick udev rule to address this at the boot time, as my journal ssds were still owned by root:disk after a reboot.

3. Radosgw service - After the upgrade, the radosgw service was still starting as user root. Also, using the start/stop/restart scripts that came with the package simply do not start the service at all. For example, start radosgw or start radosgw-all-started does not start the service. I had to use the old startup script /etc/init.d/radosgw in order to start the service, but the service is started as user root and not ceph as intended in Jewel.

Overall, after sorting out most of the issues, the cluster is running okay for 2 days now. The radosgw issue still need looking at though.

Cheers

Andrei
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com