On Sat, May 20, 2017 at 8:24 AM, Ning Yao <zay11022@xxxxxxxxx> wrote: > so it seems pretty reasonable to do it > with dynamic QoS strategy and serve the user IO first at anytime. Only > in this way, it can achieve the final goal for this issue. But part of the final goal is to minimize unhappiness including from loss of data after a double failure, which means completing a timely recovery. Giving strict priority to user I/O could starve recovery indefinitely. Some systems are *always* busy. It seems likely to result in highly variable and unpredictable recovery times. I think unpredictability about when their data "will be fully protected again" is a source of anxiety for customers, if it can take more than a few hours. One nice thing about controlling with queue depth is that it self-adjusts to the load. If the network and peer machine are idle, the operations will flow at their maximum rate for a given queue depth (IOPS = QD / RTT, the round-trip time of the entire circuit of the network and the peer service together). But if other load is present on the network or on the peer CPU, its requested operations will interleave with the recovery I/O; this drives up RTT (by slowing the peer server and/or delaying the network), automatically reducing IOPS without adjusting Queue Depth. Under high client load there will be many client I/O operations for each recovery operation. The Queue Depth can still be adjusted to set the overall aggressiveness of the recovery process. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html