Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: > 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>: >> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: >>> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>: >>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>>>>>>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>>>>>>>> + blk_req->reqs[0].opaque); >>>>>>>> >>>>>>>> Same here. >>>>>>>> >>>>>>>>> + bdrv_flush(bs); >>>>>>>> >>>>>>>> This looks really strange. What is this supposed to do? >>>>>>>> >>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you >>>>>>>> get an fsync for which you don't know if it includes the current write >>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk? >>>>>>> >>>>>>> I was expecting to flush the aio request that was just initiated. >>>>>>> Am I misunderstanding the function? >>>>>> >>>>>> Seems so. The function names don't use really clear terminology either, >>>>>> so you're not the first one to fall in this trap. Basically we have: >>>>>> >>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you >>>>>> wanted to have exactly this, but only for a single block device. Such a >>>>>> function doesn't exist yet. >>>>>> >>>>>> * bdrv_flush() makes sure that all successfully completed requests are >>>>>> written to disk (by calling fsync) >>>>>> >>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >>>>>> the fsync in the thread pool >>>>> >>>>> Then what I wanted to do is, call qemu_aio_flush first, then >>>>> bdrv_flush. It should be like live migration. >>>> >>>> Okay, that makes sense. :-) >>>> >>>>>>>> The other thing is that you introduce a bdrv_flush for each request, >>>>>>>> basically forcing everyone to something very similar to writethrough >>>>>>>> mode. I'm sure this will have a big impact on performance. >>>>>>> >>>>>>> The reason is to avoid inversion of queued requests. Although >>>>>>> processing one-by-one is heavy, wouldn't having requests flushed >>>>>>> to disk out of order break the disk image? >>>>>> >>>>>> No, that's fine. If a guest issues two requests at the same time, they >>>>>> may complete in any order. You just need to make sure that you don't >>>>>> call the completion callback before the request really has completed. >>>>> >>>>> We need to flush requests, meaning aio and fsync, before sending >>>>> the final state of the guests, to make sure we can switch to the >>>>> secondary safely. >>>> >>>> In theory I think you could just re-submit the requests on the secondary >>>> if they had not completed yet. >>>> >>>> But you're right, let's keep things simple for the start. >>>> >>>>>> I'm just starting to wonder if the guest won't timeout the requests if >>>>>> they are queued for too long. Even more, with IDE, it can only handle >>>>>> one request at a time, so not completing requests doesn't sound like a >>>>>> good idea at all. In what intervals is the event-tap queue flushed? >>>>> >>>>> The requests are flushed once each transaction completes. So >>>>> it's not with specific intervals. >>>> >>>> Right. So when is a transaction completed? This is the time that a >>>> single request will take. >>> >>> The transaction is completed when the vm state is sent to the >>> secondary, and the primary receives the ack to it. Please let me >>> know if the answer is too vague. What I can tell is that it >>> can't be super fast. >>> >>>>>> On the other hand, if you complete before actually writing out, you >>>>>> don't get timeouts, but you signal success to the guest when the request >>>>>> could still fail. What would you do in this case? With a writeback cache >>>>>> mode we're fine, we can just fail the next flush (until then nothing is >>>>>> guaranteed to be on disk and order doesn't matter either), but with >>>>>> cache=writethrough we're in serious trouble. >>>>>> >>>>>> Have you thought about this problem? Maybe we end up having to flush the >>>>>> event-tap queue for each single write in writethrough mode. >>>>> >>>>> Yes, and that's what I'm trying to do at this point. >>>> >>>> Oh, I must have missed that code. Which patch/function should I look at? >>> >>> Maybe I miss-answered to your question. The device may receive >>> timeouts. >> >> We should pay attention that the guest does not see timeouts. I'm not >> expecting that I/O will be super fast, and as long as it is only a >> performance problem we can live with it. >> >> However, as soon as the guest gets timeouts it reports I/O errors and >> eventually offlines the block device. At this point it's not a >> performance problem any more, but also a correctness problem. >> >> This is why I suggested that we flush the event-tap queue (i.e. complete >> the transaction) immediately after an I/O request has been issued >> instead of waiting for other events that would complete the transaction. > > Right. event-tap doesn't queue at specific interval. It'll > schedule the transaction as bh once events are tapped . The > purpose of the queue is store requests initiated while the > transaction. Ok, now I got it. :-) So the patches are already doing the best we can do. > So I believe current implementation should be doing > what you're expecting. However, if the guest dirtied huge amount > of ram and initiated block requests, we may get timeouts even we > started transaction right away. Right. We'll have to live with that for now. If it happens, bad luck. Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html