Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>:
>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>>> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>:
>>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>>>>
>>>>>>>> Same here.
>>>>>>>>
>>>>>>>>> +    bdrv_flush(bs);
>>>>>>>>
>>>>>>>> This looks really strange. What is this supposed to do?
>>>>>>>>
>>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>>>>> get an fsync for which you don't know if it includes the current write
>>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk?
>>>>>>>
>>>>>>> I was expecting to flush the aio request that was just initiated.
>>>>>>> Am I misunderstanding the function?
>>>>>>
>>>>>> Seems so. The function names don't use really clear terminology either,
>>>>>> so you're not the first one to fall in this trap. Basically we have:
>>>>>>
>>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>>>>> wanted to have exactly this, but only for a single block device. Such a
>>>>>> function doesn't exist yet.
>>>>>>
>>>>>> * bdrv_flush() makes sure that all successfully completed requests are
>>>>>> written to disk (by calling fsync)
>>>>>>
>>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>>>>> the fsync in the thread pool
>>>>>
>>>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>>>> bdrv_flush.  It should be like live migration.
>>>>
>>>> Okay, that makes sense. :-)
>>>>
>>>>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>>>>> basically forcing everyone to something very similar to writethrough
>>>>>>>> mode. I'm sure this will have a big impact on performance.
>>>>>>>
>>>>>>> The reason is to avoid inversion of queued requests.  Although
>>>>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>>>>> to disk out of order break the disk image?
>>>>>>
>>>>>> No, that's fine. If a guest issues two requests at the same time, they
>>>>>> may complete in any order. You just need to make sure that you don't
>>>>>> call the completion callback before the request really has completed.
>>>>>
>>>>> We need to flush requests, meaning aio and fsync, before sending
>>>>> the final state of the guests, to make sure we can switch to the
>>>>> secondary safely.
>>>>
>>>> In theory I think you could just re-submit the requests on the secondary
>>>> if they had not completed yet.
>>>>
>>>> But you're right, let's keep things simple for the start.
>>>>
>>>>>> I'm just starting to wonder if the guest won't timeout the requests if
>>>>>> they are queued for too long. Even more, with IDE, it can only handle
>>>>>> one request at a time, so not completing requests doesn't sound like a
>>>>>> good idea at all. In what intervals is the event-tap queue flushed?
>>>>>
>>>>> The requests are flushed once each transaction completes.  So
>>>>> it's not with specific intervals.
>>>>
>>>> Right. So when is a transaction completed? This is the time that a
>>>> single request will take.
>>>
>>> The transaction is completed when the vm state is sent to the
>>> secondary, and the primary receives the ack to it.  Please let me
>>> know if the answer is too vague.  What I can tell is that it
>>> can't be super fast.
>>>
>>>>>> On the other hand, if you complete before actually writing out, you
>>>>>> don't get timeouts, but you signal success to the guest when the request
>>>>>> could still fail. What would you do in this case? With a writeback cache
>>>>>> mode we're fine, we can just fail the next flush (until then nothing is
>>>>>> guaranteed to be on disk and order doesn't matter either), but with
>>>>>> cache=writethrough we're in serious trouble.
>>>>>>
>>>>>> Have you thought about this problem? Maybe we end up having to flush the
>>>>>> event-tap queue for each single write in writethrough mode.
>>>>>
>>>>> Yes, and that's what I'm trying to do at this point.
>>>>
>>>> Oh, I must have missed that code. Which patch/function should I look at?
>>>
>>> Maybe I miss-answered to your question.  The device may receive
>>> timeouts.
>>
>> We should pay attention that the guest does not see timeouts. I'm not
>> expecting that I/O will be super fast, and as long as it is only a
>> performance problem we can live with it.
>>
>> However, as soon as the guest gets timeouts it reports I/O errors and
>> eventually offlines the block device. At this point it's not a
>> performance problem any more, but also a correctness problem.
>>
>> This is why I suggested that we flush the event-tap queue (i.e. complete
>> the transaction) immediately after an I/O request has been issued
>> instead of waiting for other events that would complete the transaction.
> 
> Right.  event-tap doesn't queue at specific interval.  It'll
> schedule the transaction as bh once events are tapped .  The
> purpose of the queue is store requests initiated while the
> transaction.  

Ok, now I got it. :-)

So the patches are already doing the best we can do.

> So I believe current implementation should be doing
> what you're expecting.  However, if the guest dirtied huge amount
> of ram and initiated block requests, we may get timeouts even we
> started transaction right away.

Right. We'll have to live with that for now. If it happens, bad luck.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux