Erroneous stats output (ceph df) after increasing PG number

sweil@xxxxxxxxxx (Sage Weil) · Tue, 5 Aug 2014 07:50:17 -0700 (PDT)

On Mon, 4 Aug 2014, Konstantinos Tompoulidis wrote:
> Sage Weil <sweil at ...> writes:
> 
> > 
> > On Mon, 4 Aug 2014, Konstantinos Tompoulidis wrote:
> > > Hi all,
> > > 
> > > We recently added many OSDs to our production cluster.
> > > This brought us to a point where the number of PGs we had assigned to our 
> > > main (heavily used) pool was well below the recommended value.
> > > 
> > > We increased the PG number (incrementally to avoid huge degradation ratios) 
> > > to the recommended optimal value.
> > > 
> > > Once the procedure ended we noticed that the output of "ceph df" ( POOLS: ) 
> > > does not represent the actual state.
> > 
> > How did it mismatch reality?
> 
> At the moment the output/displayed_size of the main pool is 2.3 times the actual
> size.
> 
> > 
> > > Has anyone noticed this before and if so is there a fix?
> > 
> > There is some ambiguity in the stats after PG split that gets cleaned up 
> > on the next scrub.  I wouldn't expect it to be noticeable, though ...
> > 
> > sage
> > 
> 
> Unfortunately it is quite noticeable.
> ...
> Our setup* serves a high IO and low latency cloud infrastructure (the disks of
> the VMs are block devices on the hypervisor. The block devices are exposed
> to the OS by an open source in-house implementation which relies on
> librados). Due to the added load that scrub and deep scrub impose to the
> cluster, we decided to disable these operations. Re-enabling them has a huge
> negative impact on the performance of the infrastructure.
> 
> Is scrubbing absolutely necessary?

It's the only way to detect bitrot when running on XFS or ext4.  :/

> If yes, is there a way to mitigate it's impact on the performance?

You can set

 osd scrub sleep = .01

on current development releases; this simply injects a delay between scrub 
operations to limit the impact on other IO.  (It's crude but an effective 
stopgap until we get more sophisticated IO scheduling in giant or hammer).

I have a pending backport of this to firefly.  It is already present in 
the latest dumpling branch and will be included in 0.67.10.

sage