Jewel to Kraken OSD upgrade issues

Benjeman Meekhof <bmeekhof@xxxxxxxxx> · Tue, 14 Feb 2017 14:38:26 -0500

Hi all,

We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
(11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
mons to Kraken.

After updating ceph packages I restarted the 60 OSD on the box with
'systemctl restart ceph-osd.target'.  Very soon after the system cpu
load flat-lines at 100% with top showing all of that being system load
from ceph-osd processes.  Not long after we get OSD flapping due to
the load on the system (noout was set to start this, but perhaps
too-quickly unset post restart).

This is causing problems in the cluster, and we reboot the box.  The
OSD don't start up/mount automatically - not a new problem on this
setup.  We run 'ceph-disk activate $disk' on a list of all the
/dev/dm-X devices as output by ceph-disk list.  Everything activates
and the CPU gradually climbs to once again be a solid 100%.  No OSD
have joined cluster so it isn't causing issues.

I leave the box overnight...by the time I leave I see that 1-2 OSD on
this box are marked up/in.   By morning all are in, CPU is fine,
cluster is still fine.

This is not a show-stopping issue now that I know what happens though
it means upgrades are a several hour or overnight affair.  Next box I
will just mark all the OSD out before updating and restarting them or
try leaving them up but being sure to set noout to avoid flapping
while they churn.

Here's a log snippet from one currently spinning in the startup
process since 11am.  This is the second box we did, the first
experience being as detailed above.  Could this have anything to do
with the 'PGs are upgrading' message?

2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load lua
2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
has features 288514119978713088, adjusting msgr requires for clients
2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032 was 8705, adjusting msgr requires for
mons
2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
has features 288514394856620032, adjusting msgr requires for osds
2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
opened 148 pgs
2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
queue with priority op cut off at 64.
2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
log_to_monitors {default=true}
2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
init, starting boot process
(logs stop here, cpu spinning)

regards,
Ben
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com