Re: auto stop of scrubbing and deep scrubbing while backfilling or recovering

Wido den Hollander <wido@xxxxxxxx> · Wed, 9 Nov 2016 16:19:16 +0100 (CET)

> Op 8 november 2016 om 21:30 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> 
> 
> On Tue, 8 Nov 2016, Wido den Hollander wrote:
> > > Op 8 november 2016 om 15:19 schreef Sage Weil <sage@xxxxxxxxxxxx>:
> > > 
> > > 
> > > On Tue, 8 Nov 2016, Wido den Hollander wrote:
> > > > > Op 8 november 2016 om 9:35 schreef Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx>:
> > > > > 
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > i'm wondering if anybody has already thought about automatically
> > > > > stopping srub and deep-scrub in case of backfilling or recovering. I've
> > > > > seen several situations where scrubbing massivly raises the latency
> > > > > while doing backfilling or recovering.
> > > > > 
> > > > 
> > > > Seems like a sane change to me, but maybe a dev has a better option. I 
> > > > don't think a stop is easy, but a 'noscrub' flag could be set inside the 
> > > > OSD.
> > > > 
> > > > Maybe a config option: osd_scrub_during_recovery
> > > > 
> > > > Defaults to true, but can be set to false by the admin.
> > > > 
> > > > Before a scrub starts the OSD will check if there is recovery / 
> > > > backfilling active on the OSD and if so it will not initiate the scrub.
> > > 
> > > Yeah, it seems reasonable.  I think there are two basic options:
> > > 
> > > - Disable scrubbing locally on each OSD if it has scrubbing PGs.  Two 
> > > unrelated OSDs would be free to scrub and backfill at the same time.
> > > 
> > > - Disable scrubbing globally if any pgs are backfilling.  The reasoning 
> > > here is that if backfilling is increasing the latency on some PGs, we 
> > > don't want to increase the latency on others (by scrubbing) too.
> > > 
> > > The other consideration is that if backfil is happening it probably 
> > > doesn't mean we want to prevent scrubbing indefinitely.  Instead, I'd 
> > > suggest increasing the scrub intervals by some factor (e.g., 2x).
> > > 
> > > The first option would probably be a change in the scrub scheduling in 
> > > the OSD.
> > > 
> > 
> > I would go for the first one. Imagine a large cluster where one backfill is busy, that would otherwise halt all scrubs while only a few OSDs are involved.
> > 
> > Option one isn't that hard to implement either I think.
> 
> I added a card to trello: https://trello.com/b/ugTc2QFH/ceph-backlog
> 

Wouldn't this be enough?

https://github.com/ceph/ceph/pull/11874

Wido

> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html