Re: global backfill reservation?

David Butterfield <dab21774@xxxxxxxxx> · Sat, 20 May 2017 21:34:28 -0600

On Sat, May 20, 2017 at 8:24 AM, Ning Yao <zay11022@xxxxxxxxx> wrote:
> so it seems pretty reasonable to do it
> with dynamic QoS strategy and serve the user IO first at anytime. Only
> in this way, it can achieve the final goal for this issue.

But part of the final goal is to minimize unhappiness including from loss
of data after a double failure, which means completing a timely recovery.
Giving strict priority to user I/O could starve recovery indefinitely.  Some
systems are *always* busy.

It seems likely to result in highly variable and unpredictable recovery times.
I think unpredictability about when their data "will be fully protected again"
is a source of anxiety for customers, if it can take more than a few hours.

One nice thing about controlling with queue depth is that it self-adjusts to
the load.  If the network and peer machine are idle, the operations will flow
at their maximum rate for a given queue depth (IOPS = QD / RTT, the
round-trip time of the entire circuit of the network and the peer service
together).

But if other load is present on the network or on the peer CPU, its requested
operations will interleave with the recovery I/O; this drives up RTT (by
slowing the peer server and/or delaying the network), automatically reducing
IOPS without adjusting Queue Depth.  Under high client load there will be
many client I/O operations for each recovery operation.

The Queue Depth can still be adjusted to set the overall aggressiveness
of the recovery process.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html