Would you simply do? * ceph -s On Fri, Feb 17, 2017 at 6:26 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: > As I'm looking at logs on the OSD mentioned in previous email at this > point, I mostly see this message repeating...is this normal or > indicating a problem? This osd is marked up in the cluster. > > 2017-02-16 16:23:35.550102 7fc66fce3700 20 osd.564 152609 > share_map_peer 0x7fc6887a3000 already has epoch 152609 > 2017-02-16 16:23:35.556208 7fc66f4e2700 20 osd.564 152609 > share_map_peer 0x7fc689e35000 already has epoch 152609 > 2017-02-16 16:23:35.556233 7fc66f4e2700 20 osd.564 152609 > share_map_peer 0x7fc689e35000 already has epoch 152609 > 2017-02-16 16:23:35.577324 7fc66fce3700 20 osd.564 152609 > share_map_peer 0x7fc68f4c1000 already has epoch 152609 > 2017-02-16 16:23:35.577356 7fc6704e4700 20 osd.564 152609 > share_map_peer 0x7fc68f4c1000 already has epoch 152609 > > thanks, > Ben > > On Thu, Feb 16, 2017 at 12:19 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >> I tried starting up just a couple OSD with debug_osd = 20 and >> debug_filestore = 20. >> >> I pasted a sample of the ongoing log here. To my eyes it doesn't look >> unusual but maybe someone else sees something in here that is a >> problem: http://pastebin.com/uy8S7hps >> >> As this log is rolling on, our OSD has still not been marked up and is >> occupying 100% of a CPU core. I've done this a couple times and in a >> matter of some hours it will be marked up and CPU will drop. If more >> kraken OSD on another host are brought up the existing kraken OSD go >> back into max CPU usage again while pg recover. The trend scales >> upward as OSD are started until the system is completely saturated. >> >> I was reading the docs on async messenger settings at >> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw >> that under 'ms async max op threads' there is a note about one or more >> CPUs constantly on 100% load. As an experiment I set max op threads >> to 20 and that is the setting during the period of the pasted log. It >> seems to make no difference. >> >> Appreciate any thoughts on troubleshooting this. For the time being >> I've aborted our kraken update and will probably re-initialize any >> already updated OSD to revert to Jewel except perhaps one host to >> continue testing. >> >> thanks, >> Ben >> >> On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >>> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >>>> Hi all, >>>> >>>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken >>>> (11.2.0). OS was RHEL derivative. Prior to this we updated all the >>>> mons to Kraken. >>>> >>>> After updating ceph packages I restarted the 60 OSD on the box with >>>> 'systemctl restart ceph-osd.target'. Very soon after the system cpu >>>> load flat-lines at 100% with top showing all of that being system load >>>> from ceph-osd processes. Not long after we get OSD flapping due to >>>> the load on the system (noout was set to start this, but perhaps >>>> too-quickly unset post restart). >>>> >>>> This is causing problems in the cluster, and we reboot the box. The >>>> OSD don't start up/mount automatically - not a new problem on this >>>> setup. We run 'ceph-disk activate $disk' on a list of all the >>>> /dev/dm-X devices as output by ceph-disk list. Everything activates >>>> and the CPU gradually climbs to once again be a solid 100%. No OSD >>>> have joined cluster so it isn't causing issues. >>>> >>>> I leave the box overnight...by the time I leave I see that 1-2 OSD on >>>> this box are marked up/in. By morning all are in, CPU is fine, >>>> cluster is still fine. >>>> >>>> This is not a show-stopping issue now that I know what happens though >>>> it means upgrades are a several hour or overnight affair. Next box I >>>> will just mark all the OSD out before updating and restarting them or >>>> try leaving them up but being sure to set noout to avoid flapping >>>> while they churn. >>>> >>>> Here's a log snippet from one currently spinning in the startup >>>> process since 11am. This is the second box we did, the first >>>> experience being as detailed above. Could this have anything to do >>>> with the 'PGs are upgrading' message? >>> >>> It doesn't seem likely — there's a fixed per-PG overhead that doesn't >>> scale with the object count. I could be missing something but I don't >>> see anything in the upgrade notes that should be doing this either. >>> Try running an upgrade with "debug osd = 20" and "debug filestore = >>> 20" set and see what the log spits out. >>> -Greg >>> >>>> >>>> 2017-02-14 11:04:07.028311 7fd7a0372940 0 _get_class not permitted to load lua >>>> 2017-02-14 11:04:07.077304 7fd7a0372940 0 osd.585 135493 crush map >>>> has features 288514119978713088, adjusting msgr requires for clients >>>> 2017-02-14 11:04:07.077318 7fd7a0372940 0 osd.585 135493 crush map >>>> has features 288514394856620032 was 8705, adjusting msgr requires for >>>> mons >>>> 2017-02-14 11:04:07.077324 7fd7a0372940 0 osd.585 135493 crush map >>>> has features 288514394856620032, adjusting msgr requires for osds >>>> 2017-02-14 11:04:09.446832 7fd7a0372940 0 osd.585 135493 load_pgs >>>> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading >>>> 2017-02-14 11:04:10.246166 7fd7a0372940 0 osd.585 135493 load_pgs >>>> opened 148 pgs >>>> 2017-02-14 11:04:10.246249 7fd7a0372940 0 osd.585 135493 using 1 op >>>> queue with priority op cut off at 64. >>>> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493 >>>> log_to_monitors {default=true} >>>> 2017-02-14 11:04:12.473450 7fd7a0372940 0 osd.585 135493 done with >>>> init, starting boot process >>>> (logs stop here, cpu spinning) >>>> >>>> >>>> regards, >>>> Ben >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com