Re: recovery_deletes flag in OSDMap

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 20 Feb 2018 18:56:16 +0000 (UTC)

On Tue, 20 Feb 2018, Wido den Hollander wrote:
> On 02/20/2018 03:05 PM, Dan van der Ster wrote:
> > Hi Wido,
> > 
> > When you finish updating all osds in a cluster to luminous, the last step:
> > 
> >    ceph osd require-osd-release luminous
> > 
> > actually sets the recovery_deletes flag.
> > 
> > All our luminous clusters have this enabled:
> > 
> >    # ceph osd dump | grep recovery
> >    flags sortbitwise,recovery_deletes
> > 
> 
> Yes, I noticed.
> 
> > And that super secret redhat link explains that recovery_deletes
> > allows deletes to take place during recovery instead of at peering
> > time, which was previously the case.
> > 
> 
> Ok! The source told me that as well, but can somebody tell me the exact
> benefit of this?
> 
> Does it improve/smoothen the peering process?

Yes!

> I heard rumors that it makes peering block less, but I'm not sure. I like to
> hear facts or experiences :)

We saw this in the sepia lab cluster, which has a big CephFS file system 
that archives all of our test results.  Lots of data injested 
continuously, and some cron jobs to clean up old test results that 
are passes or very old.  When the delete jobs are running, the pg logs 
end up with lots of delete entries.  If you have an OSD go down and 
then come back up having missed some of those deletes, all of the 
deletes objects in the log would be synchronously deleted in order 
for peering to progress, blocking IO and making PGs appear stuck in 
'peering' state.  This change fixes that: peering completes immediately, 
and the deletes are done asynchronously, just like modified or new 
objects would during recovery.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html