Re: [PATCH v1 1/1] Submit the codes for QEMU disk I/O limits.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 25, 2011 at 8:08 AM, Zhi Yong Wu <zwu.kernel@xxxxxxxxx> wrote:
> On Fri, Jul 22, 2011 at 6:54 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote:
>> On Fri, Jul 22, 2011 at 10:20 AM, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx> wrote:
>>> +    elapsed_time  = (real_time - bs->slice_start[is_write]) / 1000000000.0;
>>> +    fprintf(stderr, "real_time = %ld, slice_start = %ld, elapsed_time = %g\n", real_time, bs->slice_start[is_write], elapsed_time);
>>> +
>>> +    bytes_limit        = bps_limit * slice_time;
>>> +    bytes_disp  = bs->io_disps->bytes[is_write];
>>> +    if (bs->io_limits->bps[2]) {
>>> +        bytes_disp += bs->io_disps->bytes[!is_write];
>>> +    }
>>> +
>>> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>>> +    fprintf(stderr, "bytes_limit = %g bytes_disp = %g, bytes_res = %g, elapsed_time = %g\n", bytes_limit, bytes_disp, bytes_res, elapsed_time);
>>> +
>>> +    if (bytes_disp + bytes_res <= bytes_limit) {
>>> +        if (wait) {
>>> +            *wait = 0;
>>> +        }
>>> +
>>> +       fprintf(stderr, "bytes_disp + bytes_res <= bytes_limit\n");
>>> +        return false;
>>> +    }
>>> +
>>> +    /* Calc approx time to dispatch */
>>> +    wait_time = (bytes_disp + bytes_res - bytes_limit) / bps_limit;
>>> +    if (!wait_time) {
>>> +        wait_time = 1;
>>> +    }
>>> +
>>> +    wait_time = wait_time + (slice_time - elapsed_time);
>>> +    if (wait) {
>>> +       *wait = wait_time * 1000000000 + 1;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>
>> After a slice expires all bytes/iops dispatched data is forgotten,
>> even if there are still requests queued.  This means that requests
>> issued by the guest immediately after a new 100 ms period will be
>> issued but existing queued requests will still be waiting.
>>
>> And since the queued requests don't get their next chance until later,
>> it's possible that they will be requeued because the requests that the
>> guest snuck in have brought us to the limit again.
>>
>> In order to solve this problem, we need to extend the current slice if
> Extend the current slice? like in-kernel block throttling algorithm?
> Our algorithm seems not to adopt it currently.

I'm not sure if extending the slice is necessary as long as new
requests are queued while previous requests are still queued.  But
extending slices is one way to deal with requests that span across
multiple slices.  See below.

>> there are still requests pending.  To prevent extensions from growing
>> the slice forever (and keeping too much old data around), it should be
>> alright to cull data from 2 slices ago.  The simplest way of doing
>> that is to subtract the bps/iops limits from the bytes/iops
>> dispatched.
> You mean that the largest value of current_slice_time is not more than
> 2 slice_time?

Yes.  If no single request is larger than the I/O limit, then the
timer value for a queued request should always be within the next
slice.  Therefore everything before last slice should be completed
already and we don't need to keep that history around.

>>
>>> @@ -2129,6 +2341,19 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>>     if (bdrv_check_request(bs, sector_num, nb_sectors))
>>>         return NULL;
>>>
>>> +    /* throttling disk read I/O */
>>> +    if (bs->io_limits != NULL) {
>>> +       if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>>> +                               sector_num, qiov, nb_sectors, cb, opaque);
>>
>> 5 space indent, should be 4.
>>
>>> +           fprintf(stderr, "bdrv_aio_readv: wait_time = %ld, timer value = %ld\n", wait_time, wait_time + qemu_get_clock_ns(vm_clock));
>>> +           qemu_mod_timer(bs->block_timer, wait_time + qemu_get_clock_ns(vm_clock));
>>
>> Imagine 3 requests that need to be queued: A, B, and C.  Since the
>> timer is unconditionally set each time a request is queued, the timer
>> will be set to C's wait_time.  However A or B's wait_time might be
>> earlier and we will miss that deadline!
> Yeah, exactly there is this issue.
>>
>> We really need a priority queue here.  QEMU's timers solve the same
>> problem with a sorted list, which might actually be faster for short
>> lists where a fancy data structure has too much overhead.
> You mean the block requests should be handled in FIFO way in order?
> If the block queue is not empty, should this coming request be
> enqueued at first? right?

Yes.  If the limit was previously exceeded, enqueue new requests immediately:

/* If a limit was exceeded, immediately queue this request */
if (!QTAILQ_EMPTY(&queue->requests)) {
    if (limits->bps[IO_LIMIT_TOTAL]) {
        /* queue any rd/wr request */
    } else if (limits->bps[is_write] && another_request_is_queued[is_write]) {
        /* queue if the rd/wr-specific limit was previously exceeded */
    }

    ...same for iops...
}

This way new requests cannot skip ahead of queued requests due to the
lost history when a new slice starts.

>>
>>> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>>> +                       BlockDriverState *bs,
>>> +                       BlockRequestHandler *handler,
>>> +                       int64_t sector_num,
>>> +                       QEMUIOVector *qiov,
>>> +                       int nb_sectors,
>>> +                       BlockDriverCompletionFunc *cb,
>>> +                       void *opaque)
>>> +{
>>> +    BlockIORequest *request;
>>> +    BlockDriverAIOCB *acb;
>>> +
>>> +    request = qemu_malloc(sizeof(BlockIORequest));
>>> +    request->bs = bs;
>>> +    request->handler = handler;
>>> +    request->sector_num = sector_num;
>>> +    request->qiov = qiov;
>>> +    request->nb_sectors = nb_sectors;
>>> +    request->cb = cb;
>>> +    request->opaque = opaque;
>>> +
>>> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>>> +
>>> +    acb = qemu_malloc(sizeof(*acb));
>>> +    acb->bs = bs;
>>> +    acb->cb = cb;
>>> +    acb->opaque = opaque;
>>
>> AIOPool and qemu_aio_get() should be used instead of manually doing
>> this.  Otherwise bdrv_aio_cancel() does not work.
>>
>> In order to free our acb when the I/O request completes we need a cb
>> function.  Right now acb is leaking.
> Yeah,
>>> +    acb->cb = cb; this expression can not complete what you expect?

A qemu_block_queue_cb() function should be called instead of directly
invoking the user's cb function.  This way the block queue code has a
chance to release its acb.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux