Re: Jewel to Kraken OSD upgrade issues

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 14 Feb 2017 12:55:29 -0800



On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote:
> Hi all,
>
> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken
> (11.2.0).  OS was RHEL derivative.  Prior to this we updated all the
> mons to Kraken.
>
> After updating ceph packages I restarted the 60 OSD on the box with
> 'systemctl restart ceph-osd.target'.  Very soon after the system cpu
> load flat-lines at 100% with top showing all of that being system load
> from ceph-osd processes.  Not long after we get OSD flapping due to
> the load on the system (noout was set to start this, but perhaps
> too-quickly unset post restart).
>
> This is causing problems in the cluster, and we reboot the box.  The
> OSD don't start up/mount automatically - not a new problem on this
> setup.  We run 'ceph-disk activate $disk' on a list of all the
> /dev/dm-X devices as output by ceph-disk list.  Everything activates
> and the CPU gradually climbs to once again be a solid 100%.  No OSD
> have joined cluster so it isn't causing issues.
>
> I leave the box overnight...by the time I leave I see that 1-2 OSD on
> this box are marked up/in.   By morning all are in, CPU is fine,
> cluster is still fine.
>
> This is not a show-stopping issue now that I know what happens though
> it means upgrades are a several hour or overnight affair.  Next box I
> will just mark all the OSD out before updating and restarting them or
> try leaving them up but being sure to set noout to avoid flapping
> while they churn.
>
> Here's a log snippet from one currently spinning in the startup
> process since 11am.  This is the second box we did, the first
> experience being as detailed above.  Could this have anything to do
> with the 'PGs are upgrading' message?

It doesn't seem likely — there's a fixed per-PG overhead that doesn't
scale with the object count. I could be missing something but I don't
see anything in the upgrade notes that should be doing this either.
Try running an upgrade with "debug osd = 20" and "debug filestore =
20" set and see what the log spits out.
-Greg

>
> 2017-02-14 11:04:07.028311 7fd7a0372940  0 _get_class not permitted to load lua
> 2017-02-14 11:04:07.077304 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514119978713088, adjusting msgr requires for clients
> 2017-02-14 11:04:07.077318 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514394856620032 was 8705, adjusting msgr requires for
> mons
> 2017-02-14 11:04:07.077324 7fd7a0372940  0 osd.585 135493 crush map
> has features 288514394856620032, adjusting msgr requires for osds
> 2017-02-14 11:04:09.446832 7fd7a0372940  0 osd.585 135493 load_pgs
> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading
> 2017-02-14 11:04:10.246166 7fd7a0372940  0 osd.585 135493 load_pgs
> opened 148 pgs
> 2017-02-14 11:04:10.246249 7fd7a0372940  0 osd.585 135493 using 1 op
> queue with priority op cut off at 64.
> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493
> log_to_monitors {default=true}
> 2017-02-14 11:04:12.473450 7fd7a0372940  0 osd.585 135493 done with
> init, starting boot process
> (logs stop here, cpu spinning)
>
>
> regards,
> Ben
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com