> On Dec 2, 2016, at 10:48, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Fri, 2 Dec 2016, Dan Jakubiec wrote: >> For what it's worth... this sounds like the condition we hit we >> re-enabled scrub on our 16 OSDs (after 6 to 8 weeks of noscrub). They >> flapped for about 30 minutes as most of the OSDs randomly hit suicide >> timeouts here and there. >> >> This settled down after about an hour and the OSDs stopped dying. We >> have since left scrub enabled for about 4 days and have only seen three >> small spurts of OSD flapping since then (which quickly resolved >> themselves). > > Yeah. I think what's happening is that with a cold cache it is slow > enough to suicide, but with a warm cache it manages to complete (although > I bet it's still stalling other client IO for perhaps multiple seconds). > I would leave noscrub set for now. Ah... thanks for the suggestion! We are indeed working through some jerky performance issues. Perhaps this is a layer of that onion, thank you. -- Dan > > sage > > > > >> >> -- Dan >> >>> On Dec 1, 2016, at 14:38, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote: >>> >>> Hi Yoann, >>> >>> Thank you for your input. I was just told by RH support that it’s gonna make it to RHCS 2.0 (10.2.3). Thank you guys for the fix ! >>> >>> We thought about increasing the number of PGs just after changing the merge/split threshold values but this would have led to a _lot_ of data movements (1.2 billion of XFS files) over weeks, without any possibility to scrub / deep-scrub to ensure data consistency. Still as soon as we get the fix, we will increase the number of PGs. >>> >>> Regards, >>> >>> Frederic. >>> >>> >>> >>>> Le 1 déc. 2016 à 16:47, Yoann Moulin <yoann.moulin@xxxxxxx> a écrit : >>>> >>>> Hello, >>>> >>>>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 and is no more capable to scrub neither deep-scrub. >>>>> >>>>> [1] http://tracker.ceph.com/issues/17859 >>>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007 >>>>> [3] https://github.com/ceph/ceph/pull/11898 >>>>> >>>>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4). >>>>> >>>>> Can we have this fix any sooner ? >>>> >>>> As far as I know about that bug, it appears if you have big PGs, a workaround could be increasing the pg_num of the pool that has the biggest PGs. >>>> >>>> -- >>>> Yoann Moulin >>>> EPFL IC-IT >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com