On Fri, 2 Dec 2016, Dan Jakubiec wrote: > For what it's worth... this sounds like the condition we hit we > re-enabled scrub on our 16 OSDs (after 6 to 8 weeks of noscrub). They > flapped for about 30 minutes as most of the OSDs randomly hit suicide > timeouts here and there. > > This settled down after about an hour and the OSDs stopped dying. We > have since left scrub enabled for about 4 days and have only seen three > small spurts of OSD flapping since then (which quickly resolved > themselves). Yeah. I think what's happening is that with a cold cache it is slow enough to suicide, but with a warm cache it manages to complete (although I bet it's still stalling other client IO for perhaps multiple seconds). I would leave noscrub set for now. sage > > -- Dan > > > On Dec 1, 2016, at 14:38, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote: > > > > Hi Yoann, > > > > Thank you for your input. I was just told by RH support that it’s gonna make it to RHCS 2.0 (10.2.3). Thank you guys for the fix ! > > > > We thought about increasing the number of PGs just after changing the merge/split threshold values but this would have led to a _lot_ of data movements (1.2 billion of XFS files) over weeks, without any possibility to scrub / deep-scrub to ensure data consistency. Still as soon as we get the fix, we will increase the number of PGs. > > > > Regards, > > > > Frederic. > > > > > > > >> Le 1 déc. 2016 à 16:47, Yoann Moulin <yoann.moulin@xxxxxxx> a écrit : > >> > >> Hello, > >> > >>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 and is no more capable to scrub neither deep-scrub. > >>> > >>> [1] http://tracker.ceph.com/issues/17859 > >>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007 > >>> [3] https://github.com/ceph/ceph/pull/11898 > >>> > >>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4). > >>> > >>> Can we have this fix any sooner ? > >> > >> As far as I know about that bug, it appears if you have big PGs, a workaround could be increasing the pg_num of the pool that has the biggest PGs. > >> > >> -- > >> Yoann Moulin > >> EPFL IC-IT > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >