On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: > Hi all, > > We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken > (11.2.0). OS was RHEL derivative. Prior to this we updated all the > mons to Kraken. > > After updating ceph packages I restarted the 60 OSD on the box with > 'systemctl restart ceph-osd.target'. Very soon after the system cpu > load flat-lines at 100% with top showing all of that being system load > from ceph-osd processes. Not long after we get OSD flapping due to > the load on the system (noout was set to start this, but perhaps > too-quickly unset post restart). > > This is causing problems in the cluster, and we reboot the box. The > OSD don't start up/mount automatically - not a new problem on this > setup. We run 'ceph-disk activate $disk' on a list of all the > /dev/dm-X devices as output by ceph-disk list. Everything activates > and the CPU gradually climbs to once again be a solid 100%. No OSD > have joined cluster so it isn't causing issues. > > I leave the box overnight...by the time I leave I see that 1-2 OSD on > this box are marked up/in. By morning all are in, CPU is fine, > cluster is still fine. > > This is not a show-stopping issue now that I know what happens though > it means upgrades are a several hour or overnight affair. Next box I > will just mark all the OSD out before updating and restarting them or > try leaving them up but being sure to set noout to avoid flapping > while they churn. > > Here's a log snippet from one currently spinning in the startup > process since 11am. This is the second box we did, the first > experience being as detailed above. Could this have anything to do > with the 'PGs are upgrading' message? It doesn't seem likely — there's a fixed per-PG overhead that doesn't scale with the object count. I could be missing something but I don't see anything in the upgrade notes that should be doing this either. Try running an upgrade with "debug osd = 20" and "debug filestore = 20" set and see what the log spits out. -Greg > > 2017-02-14 11:04:07.028311 7fd7a0372940 0 _get_class not permitted to load lua > 2017-02-14 11:04:07.077304 7fd7a0372940 0 osd.585 135493 crush map > has features 288514119978713088, adjusting msgr requires for clients > 2017-02-14 11:04:07.077318 7fd7a0372940 0 osd.585 135493 crush map > has features 288514394856620032 was 8705, adjusting msgr requires for > mons > 2017-02-14 11:04:07.077324 7fd7a0372940 0 osd.585 135493 crush map > has features 288514394856620032, adjusting msgr requires for osds > 2017-02-14 11:04:09.446832 7fd7a0372940 0 osd.585 135493 load_pgs > 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading > 2017-02-14 11:04:10.246166 7fd7a0372940 0 osd.585 135493 load_pgs > opened 148 pgs > 2017-02-14 11:04:10.246249 7fd7a0372940 0 osd.585 135493 using 1 op > queue with priority op cut off at 64. > 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493 > log_to_monitors {default=true} > 2017-02-14 11:04:12.473450 7fd7a0372940 0 osd.585 135493 done with > init, starting boot process > (logs stop here, cpu spinning) > > > regards, > Ben > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com