Re: [ceph-users] stalls caused by scrub on jewel

Dan Jakubiec <dan.jakubiec@xxxxxxxxx> · Fri, 2 Dec 2016 10:15:45 -0600

For what it's worth... this sounds like the condition we hit we re-enabled scrub on our 16 OSDs (after 6 to 8 weeks of noscrub).  They flapped for about 30 minutes as most of the OSDs randomly hit suicide timeouts here and there.

This settled down after about an hour and the OSDs stopped dying.  We have since left scrub enabled for about 4 days and have only seen three small spurts of OSD flapping since then (which quickly resolved themselves).

-- Dan

> On Dec 1, 2016, at 14:38, Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> wrote:
> 
> Hi Yoann,
> 
> Thank you for your input. I was just told by RH support that it’s gonna make it to RHCS 2.0 (10.2.3). Thank you guys for the fix !
> 
> We thought about increasing the number of PGs just after changing the merge/split threshold values but this would have led to a _lot_ of data movements (1.2 billion of XFS files) over weeks, without any possibility to scrub / deep-scrub to ensure data consistency. Still as soon as we get the fix, we will increase the number of PGs.
> 
> Regards,
> 
> Frederic.
> 
> 
> 
>> Le 1 déc. 2016 à 16:47, Yoann Moulin <yoann.moulin@xxxxxxx> a écrit :
>> 
>> Hello,
>> 
>>> We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 and is no more capable to scrub neither deep-scrub.
>>> 
>>> [1] http://tracker.ceph.com/issues/17859
>>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007
>>> [3] https://github.com/ceph/ceph/pull/11898
>>> 
>>> I'm worried we'll have to live with a cluster that can't scrub/deep-scrub until March 2017 (ETA for RHCS 2.2 running Jewel 10.2.4).
>>> 
>>> Can we have this fix any sooner ?
>> 
>> As far as I know about that bug, it appears if you have big PGs, a workaround could be increasing the pg_num of the pool that has the biggest PGs.
>> 
>> -- 
>> Yoann Moulin
>> EPFL IC-IT
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html