On Thu, Feb 23, 2017 at 2:34 PM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: > Hi Greg, > > Appreciate you looking into it. I'm concerned about CPU power per > daemon as well...though we never had this issue when restarting our > dense nodes under Jewel. Is the rapid rate of OSDmap generation a > one-time condition particular to post-update processing or to Kraken > in general? I'm not aware of anything that would have made this change in Kraken, but it's possible. Sorry I don't have more detail on this. -Greg > > We did eventually get all the OSD back up either by doing so in small > batches or setting nodown and waiting for the host to churn > through...a day or so later all the OSD pop up. Now that we're in a > stable non-degraded state I have to do more tests to see what happens > under Kraken when we kill a node or several nodes. > > I have to give ceph a lot of credit here. Following my email the 16th > while we were in a marginal state with kraken OSD churning to come up > we lost a data center for a minute. Subsequently we had our remaining > 2 mons refuse to stay in quorom long enough to serve cluster sessions > (constant back and forth elections). I believe the issue was timeouts > caused by explosive leveldb growth in combination with other activity > but eventually we got them to come back by increasing db lease time in > ceph settings. We had some unfound objects at this point but after > waiting out all the OSD coming online with nodown/noout set everything > was fine. I should have been more careful in applying the update but > as one of our team put it we definitely found out that Ceph is > resilient to admins as well as other disasters. > > thanks, > Ben > > On Thu, Feb 23, 2017 at 5:10 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> On Thu, Feb 16, 2017 at 9:19 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >>> I tried starting up just a couple OSD with debug_osd = 20 and >>> debug_filestore = 20. >>> >>> I pasted a sample of the ongoing log here. To my eyes it doesn't look >>> unusual but maybe someone else sees something in here that is a >>> problem: http://pastebin.com/uy8S7hps >>> >>> As this log is rolling on, our OSD has still not been marked up and is >>> occupying 100% of a CPU core. I've done this a couple times and in a >>> matter of some hours it will be marked up and CPU will drop. If more >>> kraken OSD on another host are brought up the existing kraken OSD go >>> back into max CPU usage again while pg recover. The trend scales >>> upward as OSD are started until the system is completely saturated. >>> >>> I was reading the docs on async messenger settings at >>> http://docs.ceph.com/docs/master/rados/configuration/ms-ref/ and saw >>> that under 'ms async max op threads' there is a note about one or more >>> CPUs constantly on 100% load. As an experiment I set max op threads >>> to 20 and that is the setting during the period of the pasted log. It >>> seems to make no difference. >>> >>> Appreciate any thoughts on troubleshooting this. For the time being >>> I've aborted our kraken update and will probably re-initialize any >>> already updated OSD to revert to Jewel except perhaps one host to >>> continue testing. >> >> Ah, that log looks like you're just generating OSDMaps so quickly that >> rebooting 60 at a time leaves you with a ludicrous number to churn >> through, and that takes a while. It would have been exacerbated by >> having 60 daemons fight for the CPU to process them, leading to >> flapping. >> >> You might try restarting daemons sequentially on the node instead of >> all at once. Depending on your needs it would be even cheaper if you >> set the nodown flag, though obviously that will impede IO while it >> happens. >> >> I'd be concerned that this demonstrates you don't have enough CPU >> power per daemon, though. >> -Greg >> >>> >>> thanks, >>> Ben >>> >>> On Tue, Feb 14, 2017 at 3:55 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >>>> On Tue, Feb 14, 2017 at 11:38 AM, Benjeman Meekhof <bmeekhof@xxxxxxxxx> wrote: >>>>> Hi all, >>>>> >>>>> We encountered an issue updating our OSD from Jewel (10.2.5) to Kraken >>>>> (11.2.0). OS was RHEL derivative. Prior to this we updated all the >>>>> mons to Kraken. >>>>> >>>>> After updating ceph packages I restarted the 60 OSD on the box with >>>>> 'systemctl restart ceph-osd.target'. Very soon after the system cpu >>>>> load flat-lines at 100% with top showing all of that being system load >>>>> from ceph-osd processes. Not long after we get OSD flapping due to >>>>> the load on the system (noout was set to start this, but perhaps >>>>> too-quickly unset post restart). >>>>> >>>>> This is causing problems in the cluster, and we reboot the box. The >>>>> OSD don't start up/mount automatically - not a new problem on this >>>>> setup. We run 'ceph-disk activate $disk' on a list of all the >>>>> /dev/dm-X devices as output by ceph-disk list. Everything activates >>>>> and the CPU gradually climbs to once again be a solid 100%. No OSD >>>>> have joined cluster so it isn't causing issues. >>>>> >>>>> I leave the box overnight...by the time I leave I see that 1-2 OSD on >>>>> this box are marked up/in. By morning all are in, CPU is fine, >>>>> cluster is still fine. >>>>> >>>>> This is not a show-stopping issue now that I know what happens though >>>>> it means upgrades are a several hour or overnight affair. Next box I >>>>> will just mark all the OSD out before updating and restarting them or >>>>> try leaving them up but being sure to set noout to avoid flapping >>>>> while they churn. >>>>> >>>>> Here's a log snippet from one currently spinning in the startup >>>>> process since 11am. This is the second box we did, the first >>>>> experience being as detailed above. Could this have anything to do >>>>> with the 'PGs are upgrading' message? >>>> >>>> It doesn't seem likely — there's a fixed per-PG overhead that doesn't >>>> scale with the object count. I could be missing something but I don't >>>> see anything in the upgrade notes that should be doing this either. >>>> Try running an upgrade with "debug osd = 20" and "debug filestore = >>>> 20" set and see what the log spits out. >>>> -Greg >>>> >>>>> >>>>> 2017-02-14 11:04:07.028311 7fd7a0372940 0 _get_class not permitted to load lua >>>>> 2017-02-14 11:04:07.077304 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514119978713088, adjusting msgr requires for clients >>>>> 2017-02-14 11:04:07.077318 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514394856620032 was 8705, adjusting msgr requires for >>>>> mons >>>>> 2017-02-14 11:04:07.077324 7fd7a0372940 0 osd.585 135493 crush map >>>>> has features 288514394856620032, adjusting msgr requires for osds >>>>> 2017-02-14 11:04:09.446832 7fd7a0372940 0 osd.585 135493 load_pgs >>>>> 2017-02-14 11:04:09.522249 7fd7a0372940 -1 osd.585 135493 PGs are upgrading >>>>> 2017-02-14 11:04:10.246166 7fd7a0372940 0 osd.585 135493 load_pgs >>>>> opened 148 pgs >>>>> 2017-02-14 11:04:10.246249 7fd7a0372940 0 osd.585 135493 using 1 op >>>>> queue with priority op cut off at 64. >>>>> 2017-02-14 11:04:10.256299 7fd7a0372940 -1 osd.585 135493 >>>>> log_to_monitors {default=true} >>>>> 2017-02-14 11:04:12.473450 7fd7a0372940 0 osd.585 135493 done with >>>>> init, starting boot process >>>>> (logs stop here, cpu spinning) >>>>> >>>>> >>>>> regards, >>>>> Ben >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com