Re: [PATCH v5 3/4] block: add block timer and block throttling algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Aug 10, 2011 at 7:00 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote:
> On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@xxxxxxxxx> wrote:
>> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote:
>>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx> wrote:
>>>> Note:
>>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>
>>> If an I/O request is larger than the limit itself then I think we
>>> should let it through with a warning that the limit is too low.  This
>> If it will print a waring, it seems to be not a nice way, the
>> throtting function will take no effect on this scenario.
>
> Setting the limit below the max request size is a misconfiguration.
> It's a problem that the user needs to be aware of and fix, so a
> warning is appropriate.  Unfortunately the max request size is not
> easy to determine from the block layer and may vary between guest
> OSes, that's why we need to do something at runtime.
>
> We cannot leave this case unhandled because it results in the guest
> timing out I/O without any information about the problem - that makes
> it hard to troubleshoot for the user.  Any other ideas on how to
> handle this case?
Can we constrain the io limits specified by the user? When the limits
is smaller than some value such as 1MB/s, one error is provide to the
user and remind he/her that the limits is too small.

>
>>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>>  }
>>>>  #endif
>>>>
>>>> +/* throttling disk I/O limits */
>>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->io_limits_enabled = false;
>>>> +    bs->req_from_queue    = false;
>>>> +
>>>> +    if (bs->block_queue) {
>>>> +        qemu_block_queue_flush(bs->block_queue);
>>>> +        qemu_del_block_queue(bs->block_queue);
>>>
>>> When you fix the acb lifecycle in block-queue.c this will no longer be
>>> safe.  Requests that are dispatched still have acbs that belong to
>> When the request is dispatched, if it is successfully serviced, our
>> acb will been removed.
>
> Yes, that is how the patches work right now.  But the acb lifecycle
> that I explained in response to the other patch changes this.
OK.
>
>>> this queue.  It is simplest to keep the queue for the lifetime of the
>>> BlockDriverState - it's just a list so it doesn't take up much memory.
>> I more prefer to removing this queue, even though it is simple and
>> harmless to let it exist.
>
> If the existing acbs do not touch their BlockQueue after this point in
> the code, then it will be safe to delete the queue since it will no
> longer be referenced.  If you can guarantee this then it's fine to
> keep this code.
I prefer to removing the queue.
>
>>
>>>
>>>> +    }
>>>> +
>>>> +    if (bs->block_timer) {
>>>> +        qemu_del_timer(bs->block_timer);
>>>> +        qemu_free_timer(bs->block_timer);
>>>
>>> To prevent double frees:
>>>
>>> bs->block_timer = NULL;
>> Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:)
>
> No, qemu_free_timer() is a function.  qemu_free_timer() cannot change
> the pointer variable bs->block_timer because in C functions take
> arguments by value.
>
> After qemu_free_timer() bs->block_timer will still hold the old
> address of the (freed) timer.  Then when bdrv_close() is called it
> does qemu_free_timer() again because bs->block_timer is not NULL.  It
> is illegal to free() a pointer twice and it can cause a crash.
Done.
>
>>
>>>
>>>> +    }
>>>> +
>>>> +    bs->slice_start[0]   = 0;
>>>> +    bs->slice_start[1]   = 0;
>>>> +
>>>> +    bs->slice_end[0]     = 0;
>>>> +    bs->slice_end[1]     = 0;
>>>> +}
>>>> +
>>>> +static void bdrv_block_timer(void *opaque)
>>>> +{
>>>> +    BlockDriverState *bs = opaque;
>>>> +    BlockQueue *queue = bs->block_queue;
>>>> +
>>>> +    qemu_block_queue_flush(queue);
>>>> +}
>>>> +
>>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->req_from_queue = false;
>>>> +
>>>> +    bs->block_queue    = qemu_new_block_queue();
>>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>>> +
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>>> +
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>
>>> The slice times could just be initialized to 0 here.  When
>>> bdrv_exceed_io_limits() is called it will start a new slice.
>> The result should be same as current method even though they are set to ZERO.
>
> Yes, the result is the same except we only create new slices in one place.
Done.
>
>>>> +            }
>>>> +
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    elapsed_time  = real_time - bs->slice_start[is_write];
>>>> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
>>>
>>> Why * 10.0?
>> elapsed_time currently is in ns, and it need to be translated into a
>> floating value in minutes.
>
> I think you mean seconds instead of minutes.  Then the conversion has
> nothing to do with BLOCK_IO_SLICE_TIME, it's just nanoseconds to
> seconds.  It would be clearer to drop BLOCK_IO_SLICE_TIME from the
> expression and divide by a new NANOSECONDS_PER_SECOND constant
> (1000000000.0).
Done
>
>>>
>>>> +
>>>> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
>>>> +                                      is_write, elapsed_time, &bps_wait);
>>>> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
>>>> +                                      elapsed_time, &iops_wait);
>>>> +    if (bps_ret || iops_ret) {
>>>> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
>>>> +        if (wait) {
>>>> +            *wait = max_wait;
>>>> +        }
>>>> +
>>>> +        real_time = qemu_get_clock_ns(vm_clock);
>>>> +        if (bs->slice_end[is_write] < real_time + max_wait) {
>>>> +            bs->slice_end[is_write] = real_time + max_wait;
>>>> +        }
>>>
>>> Why is this necessary?  We have already extended the slice at the top
>>> of the function.
>> Here it will adjust this end of slice time based on
>> bdrv_exceed_bps_limits(bdrv_exceed_iops_limits).
>> This will be more accurate.
>
> After rereading it I realized that this extends the slice up to the
> estimated dispatch time (max_wait) and not just by
> BLOCK_IO_SLICE_TIME.  This code now makes sense to me: we're trying to
> keep slices around while there are queued requests so that
> iops/throughput history is not forgotten when queued requests are
> dispatched.
Right.
>
>>>
>>>> +
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    if (wait) {
>>>> +        *wait = 0;
>>>> +    }
>>>> +
>>>> +    return false;
>>>> +}
>>>>
>>>>  /**************************************************************/
>>>>  /* async I/Os */
>>>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>>>  {
>>>>     BlockDriver *drv = bs->drv;
>>>>     BlockDriverAIOCB *ret;
>>>> +    uint64_t wait_time = 0;
>>>>
>>>>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>>>>
>>>> -    if (!drv)
>>>> -        return NULL;
>>>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>>>> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
>>>> +        if (bs->io_limits_enabled) {
>>>> +            bs->req_from_queue = false;
>>>> +        }
>>>>         return NULL;
>>>> +    }
>>>> +
>>>> +    /* throttling disk read I/O */
>>>> +    if (bs->io_limits_enabled) {
>>>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>>>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>>>> +                              sector_num, qiov, nb_sectors, cb, opaque);
>>>> +            qemu_mod_timer(bs->block_timer,
>>>> +                       wait_time + qemu_get_clock_ns(vm_clock));
>>>
>>> If a guest keeps sending I/O requests, say every millisecond, then we
>>> will delay dispatch forever.  In practice we hope the storage
>>> interface (virtio-blk, LSI SCSI, etc) has a resource limit that
>>> prevents this.  But the point remains that this delays requests that
>>> should be dispatched for too long.  Once the timer has been set we
>>> should not set it again.
>> Can you elaborate this?
>
> Let's wait until the next version of the patch, I've suggested many
> changes and should comment against current code instead of the way I
> think it is going to work :).
OK.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux