Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> · Fri, 21 Jan 2011 00:48:18 +0900

2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>:
> Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
>> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>:
>>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>>>> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>:
>>>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>>>>> +        return;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>>>>>
>>>>>>>>> Same here.
>>>>>>>>>
>>>>>>>>>> +    bdrv_flush(bs);
>>>>>>>>>
>>>>>>>>> This looks really strange. What is this supposed to do?
>>>>>>>>>
>>>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk?
>>>>>>>>
>>>>>>>> I was expecting to flush the aio request that was just initiated.
>>>>>>>> Am I misunderstanding the function?
>>>>>>>
>>>>>>> Seems so. The function names don't use really clear terminology either,
>>>>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>>>>
>>>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>>>>> wanted to have exactly this, but only for a single block device. Such a
>>>>>>> function doesn't exist yet.
>>>>>>>
>>>>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>>>>> written to disk (by calling fsync)
>>>>>>>
>>>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>>>>> the fsync in the thread pool
>>>>>>
>>>>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>>>>> bdrv_flush.  It should be like live migration.
>>>>>
>>>>> Okay, that makes sense. :-)
>>>>>
>>>>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>>>>
>>>>>>>> The reason is to avoid inversion of queued requests.  Although
>>>>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>>>>> to disk out of order break the disk image?
>>>>>>>
>>>>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>>>>> may complete in any order. You just need to make sure that you don't
>>>>>>> call the completion callback before the request really has completed.
>>>>>>
>>>>>> We need to flush requests, meaning aio and fsync, before sending
>>>>>> the final state of the guests, to make sure we can switch to the
>>>>>> secondary safely.
>>>>>
>>>>> In theory I think you could just re-submit the requests on the secondary
>>>>> if they had not completed yet.
>>>>>
>>>>> But you're right, let's keep things simple for the start.
>>>>>
>>>>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>>>>> one request at a time, so not completing requests doesn't sound like a
>>>>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>>>>
>>>>>> The requests are flushed once each transaction completes.  So
>>>>>> it's not with specific intervals.
>>>>>
>>>>> Right. So when is a transaction completed? This is the time that a
>>>>> single request will take.
>>>>
>>>> The transaction is completed when the vm state is sent to the
>>>> secondary, and the primary receives the ack to it.  Please let me
>>>> know if the answer is too vague.  What I can tell is that it
>>>> can't be super fast.
>>>>
>>>>>>> On the other hand, if you complete before actually writing out, you
>>>>>>> don't get timeouts, but you signal success to the guest when the request
>>>>>>> could still fail. What would you do in this case? With a writeback cache
>>>>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>>>>> cache=writethrough we're in serious trouble.
>>>>>>>
>>>>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>>>>> event-tap queue for each single write in writethrough mode.
>>>>>>
>>>>>> Yes, and that's what I'm trying to do at this point.
>>>>>
>>>>> Oh, I must have missed that code. Which patch/function should I look at?
>>>>
>>>> Maybe I miss-answered to your question.  The device may receive
>>>> timeouts.
>>>
>>> We should pay attention that the guest does not see timeouts. I'm not
>>> expecting that I/O will be super fast, and as long as it is only a
>>> performance problem we can live with it.
>>>
>>> However, as soon as the guest gets timeouts it reports I/O errors and
>>> eventually offlines the block device. At this point it's not a
>>> performance problem any more, but also a correctness problem.
>>>
>>> This is why I suggested that we flush the event-tap queue (i.e. complete
>>> the transaction) immediately after an I/O request has been issued
>>> instead of waiting for other events that would complete the transaction.
>>
>> Right.  event-tap doesn't queue at specific interval.  It'll
>> schedule the transaction as bh once events are tapped .  The
>> purpose of the queue is store requests initiated while the
>> transaction.
>
> Ok, now I got it. :-)
>
> So the patches are already doing the best we can do.
>
>> So I believe current implementation should be doing
>> what you're expecting.  However, if the guest dirtied huge amount
>> of ram and initiated block requests, we may get timeouts even we
>> started transaction right away.
>
> Right. We'll have to live with that for now. If it happens, bad luck.

Yeah.  But I'll keep thinking to remove that limitation :)

Yoshi

>
> Kevin
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html