Sure, looks as follows: ceph -s cluster 24b3bf92-299d-426c-ae56-48a995014f04 health HEALTH_ERR 775 pgs are stuck inactive for more than 300 seconds 1954 pgs backfill_wait 7764 pgs degraded 517 pgs down 3 pgs inconsistent 504 pgs peering 31 pgs recovering 5661 pgs recovery_wait 6976 pgs stuck degraded 775 pgs stuck inactive 8362 pgs stuck unclean 1851 pgs stuck undersized 1880 pgs undersized 110 requests are blocked > 32 sec recovery 2788277/17810399 objects degraded (15.655%) recovery 1846569/17810399 objects misplaced (10.368%) recovery 11442/5635366 unfound (0.203%) 76 scrub errors monmap e4: 3 mons at {msu-mon01=207.73.217.13:6789/0,um-mon01=141.211.169.13:6789/0,wsu-mon01=204.39.195.13:6789/0} election epoch 23402, quorum 0,1,2 um-mon01,wsu-mon01,msu-mon01 fsmap e1074: 1/1/1 up {0=wsu-mds01=up:active}, 1 up:standby-replay, 1 up:standby mgr active: um-mon01 osdmap e152705: 627 osds: 475 up, 475 in; 2434 remapped pgs flags sortbitwise,require_jewel_osds pgmap v13524534: 20864 pgs, 25 pools, 21246 GB data, 5503 kobjects 59604 GB used, 3397 TB / 3455 TB avail 2788277/17810399 objects degraded (15.655%) 1846569/17810399 objects misplaced (10.368%) 11442/5635366 unfound (0.203%) 12501 active+clean 5638 active+recovery_wait+degraded 1617 active+undersized+degraded+remapped+backfill_wait 436 down+remapped+peering 258 undersized+degraded+remapped+backfill_wait+peered 189 active+degraded 79 active+remapped+backfill_wait 68 down+peering 31 active+recovering+degraded 20 active+recovery_wait+degraded+remapped 10 down 4 active+degraded+remapped 3 down+remapped 3 active+undersized+degraded+remapped 2 active+recovery_wait+undersized+degraded+remapped 2 active+remapped 1 active+clean+inconsistent 1 active+recovery_wait+degraded+inconsistent 1 active+degraded+inconsistent On Thu, Feb 16, 2017 at 5:08 PM, Shinobu Kinjo <skinjo@xxxxxxxxxx> wrote: > Would you simply do? > > * ceph -s > > On Fri, Feb 17, 2017 at 6:26 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >> As I'm looking at logs on the OSD mentioned in previous email at this >> point, I mostly see this message repeating...is this normal or >> indicating a problem? This osd is marked up in the cluster. >> >> 2017-02-16 16:23:35.550102 7fc66fce3700 20 osd.564 152609 >> share_map_peer 0x7fc6887a3000 already has epoch 152609 >> 2017-02-16 16:23:35.556208 7fc66f4e2700 20 osd.564 152609 >> share_map_peer 0x7fc689e35000 already has epoch 152609 >> 2017-02-16 16:23:35.556233 7fc66f4e2700 20 osd.564 152609 >> share_map_peer 0x7fc689e35000 already has epoch 152609 >> 2017-02-16 16:23:35.577324 7fc66fce3700 20 osd.564 152609 >> share_map_peer 0x7fc68f4c1000 already has epoch 152609 >> 2017-02-16 16:23:35.577356 7fc6704e4700 20 osd.564 152609 >> share_map_peer 0x7fc68f4c1000 already has epoch 152609 >> >> thanks, >> Ben >> >> On Thu, Feb 16, 2017 at 12:19 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >>> I tried starting up just a couple OSD with debug_osd = 20 and >>> debug_filestore = 20. >>> >>> I pasted a sample of the ongoing log here. To my eyes it doesn't look >>> unusual but maybe someone else sees something in here that is a >>> problem: http://pastebin.com/uy8S7hps >>> >>> As this log is rolling on, our OSD has still not been marked up and is >>> occupying 100% of a CPU core. I've done this a couple times and in a >>> matter of some hours it will be marked up and CPU will drop. If more >>> kraken OSD on another host are brought up the existing kraken OSD go >>> back into max CPU usage again while pg recover. The trend scales >>> upward as OSD are started until the system is completely saturated. >>> >>> I was reading the docs on async messenger settings at >>> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw >>> that under 'ms async max op threads' there is a note about one or more >>> CPUs constantly on 100% load. As an experiment I set max op threads >>> to 20 and that is the setting during the period of the pasted log. It >>> seems to make no difference. >>> >>> Appreciate any thoughts on troubleshooting this. For the time being >>> I've aborted our kraken update and will probably re-initialize any >>> already updated OSD to revert to Jewel except perhaps one host to >>> continue testing. >>> >>> thanks, >>> Ben >>> >>> On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >>>> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >>>>> Hi all, >>>>> >>>>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken >>>>> (11.2.0). OS was RHEL derivative. Prior to this we updated all the >>>>> mons to Kraken. >>>>> >>>>> After updating ceph packages I restarted the 60 OSD on the box with >>>>> 'systemctl restart ceph-osd.target'. Very soon after the system cpu >>>>> load flat-lines at 100% with top showing all of that being system load >>>>> from ceph-osd processes. Not long after we get OSD flapping due to >>>>> the load on the system (noout was set to start this, but perhaps >>>>> too-quickly unset post restart). >>>>> >>>>> This is causing problems in the cluster, and we reboot the box. The >>>>> OSD don't start up/mount automatically - not a new problem on this >>>>> setup. We run 'ceph-disk activate $disk' on a list of all the >>>>> /dev/dm-X devices as output by ceph-disk list. Everything activates >>>>> and the CPU gradually climbs to once again be a solid 100%. No OSD >>>>> have joined cluster so it isn't causing issues. >>>>> >>>>> I leave the box overnight...by the time I leave I see that 1-2 OSD on >>>>> this box are marked up/in. By morning all are in, CPU is fine, >>>>> cluster is still fine. >>>>> >>>>> This is not a show-stopping issue now that I know what happens though >>>>> it means upgrades are a several hour or overnight affair. Next box I >>>>> will just mark all the OSD out before updating and restarting them or >>>>> try leaving them up but being sure to set noout to avoid flapping >>>>> while they churn. >>>>> >>>>> Here's a log snippet from one currently spinning in the startup >>>>> process since 11am. This is the second box we did, the first >>>>> experience being as detailed above. Could this have anything to do >>>>> with the 'PGs are upgrading' message? >>>> >>>> It doesn't seem likely — there's a fixed per-PG overhead that doesn't >>>> scale with the object count. I could be missing something but I don't >>>> see anything in the upgrade notes that should be doing this either. >>>> Try running an upgrade with "debug osd = 20" and "debug filestore = >>>> 20" set and see what the log spits out. >>>> -Greg >>>> >>>>> >>>>> 2017-02-14 11:04:07.028311 7fd7a0372940 0 _get_class not permitted to load lua >>>>> 2017-02-14 11:04:07.077304 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514119978713088, adjusting msgr requires for clients >>>>> 2017-02-14 11:04:07.077318 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514394856620032 was 8705, adjusting msgr requires for >>>>> mons >>>>> 2017-02-14 11:04:07.077324 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514394856620032, adjusting msgr requires for osds >>>>> 2017-02-14 11:04:09.446832 7fd7a0372940 0 osd.585 135493 load_pgs >>>>> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading >>>>> 2017-02-14 11:04:10.246166 7fd7a0372940 0 osd.585 135493 load_pgs >>>>> opened 148 pgs >>>>> 2017-02-14 11:04:10.246249 7fd7a0372940 0 osd.585 135493 using 1 op >>>>> queue with priority op cut off at 64. >>>>> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493 >>>>> log_to_monitors {default=true} >>>>> 2017-02-14 11:04:12.473450 7fd7a0372940 0 osd.585 135493 done with >>>>> init, starting boot process >>>>> (logs stop here, cpu spinning) >>>>> >>>>> >>>>> regards, >>>>> Ben >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com