Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

Kevin Wolf <kwolf@xxxxxxxxxx> · Wed, 19 Jan 2011 14:50:57 +0100

Am 19.01.2011 14:04, schrieb Yoshiaki Tamura:
>>> +static void event_tap_blk_flush(EventTapBlkReq *blk_req)
>>> +{
>>> +    BlockDriverState *bs;
>>> +
>>> +    bs = bdrv_find(blk_req->device_name);
>>
>> Please store the BlockDriverState in blk_req. This code loops over all
>> block devices and does a string comparison - and that for each request.
>> You can also save the qemu_strdup() when creating the request.
>>
>> In the few places where you really need the device name (might be the
>> case for load/save, I'm not sure), you can still get it from the
>> BlockDriverState.
> 
> I would do so for the primary side.  Although we haven't
> implemented yet, we want to replay block requests from block
> layer on the secondary side, and need device name to restore
> BlockDriverState.

Hm, I see. I'm not happy about it, but I don't have a suggestion right
away how to avoid it.

>>
>>> +
>>> +    if (blk_req->is_flush) {
>>> +        bdrv_aio_flush(bs, blk_req->reqs[0].cb, blk_req->reqs[0].opaque);
>>
>> You need to handle errors. If bdrv_aio_flush returns NULL, call the
>> callback with -EIO.
> 
> I'll do so.
> 
>>
>>> +        return;
>>> +    }
>>> +
>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>> +                    blk_req->reqs[0].opaque);
>>
>> Same here.
>>
>>> +    bdrv_flush(bs);
>>
>> This looks really strange. What is this supposed to do?
>>
>> One point is that you write it immediately after bdrv_aio_write, so you
>> get an fsync for which you don't know if it includes the current write
>> request or if it doesn't. Which data do you want to get flushed to the disk?
> 
> I was expecting to flush the aio request that was just initiated.
> Am I misunderstanding the function?

Seems so. The function names don't use really clear terminology either,
so you're not the first one to fall in this trap. Basically we have:

* qemu_aio_flush() waits for all AIO requests to complete. I think you
wanted to have exactly this, but only for a single block device. Such a
function doesn't exist yet.

* bdrv_flush() makes sure that all successfully completed requests are
written to disk (by calling fsync)

* bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
the fsync in the thread pool

>> The other thing is that you introduce a bdrv_flush for each request,
>> basically forcing everyone to something very similar to writethrough
>> mode. I'm sure this will have a big impact on performance.
> 
> The reason is to avoid inversion of queued requests.  Although
> processing one-by-one is heavy, wouldn't having requests flushed
> to disk out of order break the disk image?

No, that's fine. If a guest issues two requests at the same time, they
may complete in any order. You just need to make sure that you don't
call the completion callback before the request really has completed.

I'm just starting to wonder if the guest won't timeout the requests if
they are queued for too long. Even more, with IDE, it can only handle
one request at a time, so not completing requests doesn't sound like a
good idea at all. In what intervals is the event-tap queue flushed?

On the other hand, if you complete before actually writing out, you
don't get timeouts, but you signal success to the guest when the request
could still fail. What would you do in this case? With a writeback cache
mode we're fine, we can just fail the next flush (until then nothing is
guaranteed to be on disk and order doesn't matter either), but with
cache=writethrough we're in serious trouble.

Have you thought about this problem? Maybe we end up having to flush the
event-tap queue for each single write in writethrough mode.

>> Additionally, error handling is missing.
> 
> I looked at the codes using bdrv_flush and realized some of them
> doesn't handle errors, but scsi-disk.c does.  Should everyone
> handle errors or depends on the usage?

I added the return code only recently, it was a void function
previously. Probably some error handling should be added to all of them.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html