Re: [PATCH v5 3/4] block: add block timer and block throttling algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@xxxxxxxxx> wrote:
> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote:
>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx> wrote:
>>> Note:
>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>
>> If an I/O request is larger than the limit itself then I think we
>> should let it through with a warning that the limit is too low.  This
> If it will print a waring, it seems to be not a nice way, the
> throtting function will take no effect on this scenario.

Setting the limit below the max request size is a misconfiguration.
It's a problem that the user needs to be aware of and fix, so a
warning is appropriate.  Unfortunately the max request size is not
easy to determine from the block layer and may vary between guest
OSes, that's why we need to do something at runtime.

We cannot leave this case unhandled because it results in the guest
timing out I/O without any information about the problem - that makes
it hard to troubleshoot for the user.  Any other ideas on how to
handle this case?

>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>  }
>>>  #endif
>>>
>>> +/* throttling disk I/O limits */
>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>> +{
>>> +    bs->io_limits_enabled = false;
>>> +    bs->req_from_queue    = false;
>>> +
>>> +    if (bs->block_queue) {
>>> +        qemu_block_queue_flush(bs->block_queue);
>>> +        qemu_del_block_queue(bs->block_queue);
>>
>> When you fix the acb lifecycle in block-queue.c this will no longer be
>> safe.  Requests that are dispatched still have acbs that belong to
> When the request is dispatched, if it is successfully serviced, our
> acb will been removed.

Yes, that is how the patches work right now.  But the acb lifecycle
that I explained in response to the other patch changes this.

>> this queue.  It is simplest to keep the queue for the lifetime of the
>> BlockDriverState - it's just a list so it doesn't take up much memory.
> I more prefer to removing this queue, even though it is simple and
> harmless to let it exist.

If the existing acbs do not touch their BlockQueue after this point in
the code, then it will be safe to delete the queue since it will no
longer be referenced.  If you can guarantee this then it's fine to
keep this code.

>
>>
>>> +    }
>>> +
>>> +    if (bs->block_timer) {
>>> +        qemu_del_timer(bs->block_timer);
>>> +        qemu_free_timer(bs->block_timer);
>>
>> To prevent double frees:
>>
>> bs->block_timer = NULL;
> Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:)

No, qemu_free_timer() is a function.  qemu_free_timer() cannot change
the pointer variable bs->block_timer because in C functions take
arguments by value.

After qemu_free_timer() bs->block_timer will still hold the old
address of the (freed) timer.  Then when bdrv_close() is called it
does qemu_free_timer() again because bs->block_timer is not NULL.  It
is illegal to free() a pointer twice and it can cause a crash.

>
>>
>>> +    }
>>> +
>>> +    bs->slice_start[0]   = 0;
>>> +    bs->slice_start[1]   = 0;
>>> +
>>> +    bs->slice_end[0]     = 0;
>>> +    bs->slice_end[1]     = 0;
>>> +}
>>> +
>>> +static void bdrv_block_timer(void *opaque)
>>> +{
>>> +    BlockDriverState *bs = opaque;
>>> +    BlockQueue *queue = bs->block_queue;
>>> +
>>> +    qemu_block_queue_flush(queue);
>>> +}
>>> +
>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>> +{
>>> +    bs->req_from_queue = false;
>>> +
>>> +    bs->block_queue    = qemu_new_block_queue();
>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>> +
>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>> +
>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>
>> The slice times could just be initialized to 0 here.  When
>> bdrv_exceed_io_limits() is called it will start a new slice.
> The result should be same as current method even though they are set to ZERO.

Yes, the result is the same except we only create new slices in one place.

>>> +            }
>>> +
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    elapsed_time  = real_time - bs->slice_start[is_write];
>>> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
>>
>> Why * 10.0?
> elapsed_time currently is in ns, and it need to be translated into a
> floating value in minutes.

I think you mean seconds instead of minutes.  Then the conversion has
nothing to do with BLOCK_IO_SLICE_TIME, it's just nanoseconds to
seconds.  It would be clearer to drop BLOCK_IO_SLICE_TIME from the
expression and divide by a new NANOSECONDS_PER_SECOND constant
(1000000000.0).

>>
>>> +
>>> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
>>> +                                      is_write, elapsed_time, &bps_wait);
>>> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
>>> +                                      elapsed_time, &iops_wait);
>>> +    if (bps_ret || iops_ret) {
>>> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
>>> +        if (wait) {
>>> +            *wait = max_wait;
>>> +        }
>>> +
>>> +        real_time = qemu_get_clock_ns(vm_clock);
>>> +        if (bs->slice_end[is_write] < real_time + max_wait) {
>>> +            bs->slice_end[is_write] = real_time + max_wait;
>>> +        }
>>
>> Why is this necessary?  We have already extended the slice at the top
>> of the function.
> Here it will adjust this end of slice time based on
> bdrv_exceed_bps_limits(bdrv_exceed_iops_limits).
> This will be more accurate.

After rereading it I realized that this extends the slice up to the
estimated dispatch time (max_wait) and not just by
BLOCK_IO_SLICE_TIME.  This code now makes sense to me: we're trying to
keep slices around while there are queued requests so that
iops/throughput history is not forgotten when queued requests are
dispatched.

>>
>>> +
>>> +        return true;
>>> +    }
>>> +
>>> +    if (wait) {
>>> +        *wait = 0;
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>>
>>>  /**************************************************************/
>>>  /* async I/Os */
>>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>>  {
>>>     BlockDriver *drv = bs->drv;
>>>     BlockDriverAIOCB *ret;
>>> +    uint64_t wait_time = 0;
>>>
>>>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>>>
>>> -    if (!drv)
>>> -        return NULL;
>>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>>> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
>>> +        if (bs->io_limits_enabled) {
>>> +            bs->req_from_queue = false;
>>> +        }
>>>         return NULL;
>>> +    }
>>> +
>>> +    /* throttling disk read I/O */
>>> +    if (bs->io_limits_enabled) {
>>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>>> +                              sector_num, qiov, nb_sectors, cb, opaque);
>>> +            qemu_mod_timer(bs->block_timer,
>>> +                       wait_time + qemu_get_clock_ns(vm_clock));
>>
>> If a guest keeps sending I/O requests, say every millisecond, then we
>> will delay dispatch forever.  In practice we hope the storage
>> interface (virtio-blk, LSI SCSI, etc) has a resource limit that
>> prevents this.  But the point remains that this delays requests that
>> should be dispatched for too long.  Once the timer has been set we
>> should not set it again.
> Can you elaborate this?

Let's wait until the next version of the patch, I've suggested many
changes and should comment against current code instead of the way I
think it is going to work :).

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux