On Wed, Aug 10, 2011 at 7:00 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote: > On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@xxxxxxxxx> wrote: >> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@xxxxxxxxx> wrote: >>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@xxxxxxxxxxxxxxxxxx> wrote: >>>> Note: >>>> 1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario. >>> >>> If an I/O request is larger than the limit itself then I think we >>> should let it through with a warning that the limit is too low. This >> If it will print a waring, it seems to be not a nice way, the >> throtting function will take no effect on this scenario. > > Setting the limit below the max request size is a misconfiguration. > It's a problem that the user needs to be aware of and fix, so a > warning is appropriate. Unfortunately the max request size is not > easy to determine from the block layer and may vary between guest > OSes, that's why we need to do something at runtime. > > We cannot leave this case unhandled because it results in the guest > timing out I/O without any information about the problem - that makes > it hard to troubleshoot for the user. Any other ideas on how to > handle this case? Can we constrain the io limits specified by the user? When the limits is smaller than some value such as 1MB/s, one error is provide to the user and remind he/her that the limits is too small. > >>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename) >>>> } >>>> #endif >>>> >>>> +/* throttling disk I/O limits */ >>>> +void bdrv_io_limits_disable(BlockDriverState *bs) >>>> +{ >>>> + bs->io_limits_enabled = false; >>>> + bs->req_from_queue = false; >>>> + >>>> + if (bs->block_queue) { >>>> + qemu_block_queue_flush(bs->block_queue); >>>> + qemu_del_block_queue(bs->block_queue); >>> >>> When you fix the acb lifecycle in block-queue.c this will no longer be >>> safe. Requests that are dispatched still have acbs that belong to >> When the request is dispatched, if it is successfully serviced, our >> acb will been removed. > > Yes, that is how the patches work right now. But the acb lifecycle > that I explained in response to the other patch changes this. OK. > >>> this queue. It is simplest to keep the queue for the lifetime of the >>> BlockDriverState - it's just a list so it doesn't take up much memory. >> I more prefer to removing this queue, even though it is simple and >> harmless to let it exist. > > If the existing acbs do not touch their BlockQueue after this point in > the code, then it will be safe to delete the queue since it will no > longer be referenced. If you can guarantee this then it's fine to > keep this code. I prefer to removing the queue. > >> >>> >>>> + } >>>> + >>>> + if (bs->block_timer) { >>>> + qemu_del_timer(bs->block_timer); >>>> + qemu_free_timer(bs->block_timer); >>> >>> To prevent double frees: >>> >>> bs->block_timer = NULL; >> Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:) > > No, qemu_free_timer() is a function. qemu_free_timer() cannot change > the pointer variable bs->block_timer because in C functions take > arguments by value. > > After qemu_free_timer() bs->block_timer will still hold the old > address of the (freed) timer. Then when bdrv_close() is called it > does qemu_free_timer() again because bs->block_timer is not NULL. It > is illegal to free() a pointer twice and it can cause a crash. Done. > >> >>> >>>> + } >>>> + >>>> + bs->slice_start[0] = 0; >>>> + bs->slice_start[1] = 0; >>>> + >>>> + bs->slice_end[0] = 0; >>>> + bs->slice_end[1] = 0; >>>> +} >>>> + >>>> +static void bdrv_block_timer(void *opaque) >>>> +{ >>>> + BlockDriverState *bs = opaque; >>>> + BlockQueue *queue = bs->block_queue; >>>> + >>>> + qemu_block_queue_flush(queue); >>>> +} >>>> + >>>> +void bdrv_io_limits_enable(BlockDriverState *bs) >>>> +{ >>>> + bs->req_from_queue = false; >>>> + >>>> + bs->block_queue = qemu_new_block_queue(); >>>> + bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); >>>> + >>>> + bs->slice_start[BLOCK_IO_LIMIT_READ] = qemu_get_clock_ns(vm_clock); >>>> + bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock); >>>> + >>>> + bs->slice_end[BLOCK_IO_LIMIT_READ] = >>>> + qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME; >>>> + bs->slice_end[BLOCK_IO_LIMIT_WRITE] = >>>> + qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME; >>> >>> The slice times could just be initialized to 0 here. When >>> bdrv_exceed_io_limits() is called it will start a new slice. >> The result should be same as current method even though they are set to ZERO. > > Yes, the result is the same except we only create new slices in one place. Done. > >>>> + } >>>> + >>>> + return true; >>>> + } >>>> + } >>>> + >>>> + elapsed_time = real_time - bs->slice_start[is_write]; >>>> + elapsed_time /= (BLOCK_IO_SLICE_TIME * 10.0); >>> >>> Why * 10.0? >> elapsed_time currently is in ns, and it need to be translated into a >> floating value in minutes. > > I think you mean seconds instead of minutes. Then the conversion has > nothing to do with BLOCK_IO_SLICE_TIME, it's just nanoseconds to > seconds. It would be clearer to drop BLOCK_IO_SLICE_TIME from the > expression and divide by a new NANOSECONDS_PER_SECOND constant > (1000000000.0). Done > >>> >>>> + >>>> + bps_ret = bdrv_exceed_bps_limits(bs, nb_sectors, >>>> + is_write, elapsed_time, &bps_wait); >>>> + iops_ret = bdrv_exceed_iops_limits(bs, is_write, >>>> + elapsed_time, &iops_wait); >>>> + if (bps_ret || iops_ret) { >>>> + max_wait = bps_wait > iops_wait ? bps_wait : iops_wait; >>>> + if (wait) { >>>> + *wait = max_wait; >>>> + } >>>> + >>>> + real_time = qemu_get_clock_ns(vm_clock); >>>> + if (bs->slice_end[is_write] < real_time + max_wait) { >>>> + bs->slice_end[is_write] = real_time + max_wait; >>>> + } >>> >>> Why is this necessary? We have already extended the slice at the top >>> of the function. >> Here it will adjust this end of slice time based on >> bdrv_exceed_bps_limits(bdrv_exceed_iops_limits). >> This will be more accurate. > > After rereading it I realized that this extends the slice up to the > estimated dispatch time (max_wait) and not just by > BLOCK_IO_SLICE_TIME. This code now makes sense to me: we're trying to > keep slices around while there are queued requests so that > iops/throughput history is not forgotten when queued requests are > dispatched. Right. > >>> >>>> + >>>> + return true; >>>> + } >>>> + >>>> + if (wait) { >>>> + *wait = 0; >>>> + } >>>> + >>>> + return false; >>>> +} >>>> >>>> /**************************************************************/ >>>> /* async I/Os */ >>>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num, >>>> { >>>> BlockDriver *drv = bs->drv; >>>> BlockDriverAIOCB *ret; >>>> + uint64_t wait_time = 0; >>>> >>>> trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque); >>>> >>>> - if (!drv) >>>> - return NULL; >>>> - if (bdrv_check_request(bs, sector_num, nb_sectors)) >>>> + if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) { >>>> + if (bs->io_limits_enabled) { >>>> + bs->req_from_queue = false; >>>> + } >>>> return NULL; >>>> + } >>>> + >>>> + /* throttling disk read I/O */ >>>> + if (bs->io_limits_enabled) { >>>> + if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) { >>>> + ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv, >>>> + sector_num, qiov, nb_sectors, cb, opaque); >>>> + qemu_mod_timer(bs->block_timer, >>>> + wait_time + qemu_get_clock_ns(vm_clock)); >>> >>> If a guest keeps sending I/O requests, say every millisecond, then we >>> will delay dispatch forever. In practice we hope the storage >>> interface (virtio-blk, LSI SCSI, etc) has a resource limit that >>> prevents this. But the point remains that this delays requests that >>> should be dispatched for too long. Once the timer has been set we >>> should not set it again. >> Can you elaborate this? > > Let's wait until the next version of the patch, I've suggested many > changes and should comment against current code instead of the way I > think it is going to work :). OK. > > Stefan > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html