2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>: > Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: >> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>: >>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: >>>> 2011/1/20 Kevin Wolf <kwolf@xxxxxxxxxx>: >>>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >>>>>>>>>> + return; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>>>>>>>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>>>>>>>>> + blk_req->reqs[0].opaque); >>>>>>>>> >>>>>>>>> Same here. >>>>>>>>> >>>>>>>>>> + bdrv_flush(bs); >>>>>>>>> >>>>>>>>> This looks really strange. What is this supposed to do? >>>>>>>>> >>>>>>>>> One point is that you write it immediately after bdrv_aio_write, so you >>>>>>>>> get an fsync for which you don't know if it includes the current write >>>>>>>>> request or if it doesn't. Which data do you want to get flushed to the disk? >>>>>>>> >>>>>>>> I was expecting to flush the aio request that was just initiated. >>>>>>>> Am I misunderstanding the function? >>>>>>> >>>>>>> Seems so. The function names don't use really clear terminology either, >>>>>>> so you're not the first one to fall in this trap. Basically we have: >>>>>>> >>>>>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you >>>>>>> wanted to have exactly this, but only for a single block device. Such a >>>>>>> function doesn't exist yet. >>>>>>> >>>>>>> * bdrv_flush() makes sure that all successfully completed requests are >>>>>>> written to disk (by calling fsync) >>>>>>> >>>>>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >>>>>>> the fsync in the thread pool >>>>>> >>>>>> Then what I wanted to do is, call qemu_aio_flush first, then >>>>>> bdrv_flush. It should be like live migration. >>>>> >>>>> Okay, that makes sense. :-) >>>>> >>>>>>>>> The other thing is that you introduce a bdrv_flush for each request, >>>>>>>>> basically forcing everyone to something very similar to writethrough >>>>>>>>> mode. I'm sure this will have a big impact on performance. >>>>>>>> >>>>>>>> The reason is to avoid inversion of queued requests. Although >>>>>>>> processing one-by-one is heavy, wouldn't having requests flushed >>>>>>>> to disk out of order break the disk image? >>>>>>> >>>>>>> No, that's fine. If a guest issues two requests at the same time, they >>>>>>> may complete in any order. You just need to make sure that you don't >>>>>>> call the completion callback before the request really has completed. >>>>>> >>>>>> We need to flush requests, meaning aio and fsync, before sending >>>>>> the final state of the guests, to make sure we can switch to the >>>>>> secondary safely. >>>>> >>>>> In theory I think you could just re-submit the requests on the secondary >>>>> if they had not completed yet. >>>>> >>>>> But you're right, let's keep things simple for the start. >>>>> >>>>>>> I'm just starting to wonder if the guest won't timeout the requests if >>>>>>> they are queued for too long. Even more, with IDE, it can only handle >>>>>>> one request at a time, so not completing requests doesn't sound like a >>>>>>> good idea at all. In what intervals is the event-tap queue flushed? >>>>>> >>>>>> The requests are flushed once each transaction completes. So >>>>>> it's not with specific intervals. >>>>> >>>>> Right. So when is a transaction completed? This is the time that a >>>>> single request will take. >>>> >>>> The transaction is completed when the vm state is sent to the >>>> secondary, and the primary receives the ack to it. Please let me >>>> know if the answer is too vague. What I can tell is that it >>>> can't be super fast. >>>> >>>>>>> On the other hand, if you complete before actually writing out, you >>>>>>> don't get timeouts, but you signal success to the guest when the request >>>>>>> could still fail. What would you do in this case? With a writeback cache >>>>>>> mode we're fine, we can just fail the next flush (until then nothing is >>>>>>> guaranteed to be on disk and order doesn't matter either), but with >>>>>>> cache=writethrough we're in serious trouble. >>>>>>> >>>>>>> Have you thought about this problem? Maybe we end up having to flush the >>>>>>> event-tap queue for each single write in writethrough mode. >>>>>> >>>>>> Yes, and that's what I'm trying to do at this point. >>>>> >>>>> Oh, I must have missed that code. Which patch/function should I look at? >>>> >>>> Maybe I miss-answered to your question. The device may receive >>>> timeouts. >>> >>> We should pay attention that the guest does not see timeouts. I'm not >>> expecting that I/O will be super fast, and as long as it is only a >>> performance problem we can live with it. >>> >>> However, as soon as the guest gets timeouts it reports I/O errors and >>> eventually offlines the block device. At this point it's not a >>> performance problem any more, but also a correctness problem. >>> >>> This is why I suggested that we flush the event-tap queue (i.e. complete >>> the transaction) immediately after an I/O request has been issued >>> instead of waiting for other events that would complete the transaction. >> >> Right. event-tap doesn't queue at specific interval. It'll >> schedule the transaction as bh once events are tapped . The >> purpose of the queue is store requests initiated while the >> transaction. > > Ok, now I got it. :-) > > So the patches are already doing the best we can do. > >> So I believe current implementation should be doing >> what you're expecting. However, if the guest dirtied huge amount >> of ram and initiated block requests, we may get timeouts even we >> started transaction right away. > > Right. We'll have to live with that for now. If it happens, bad luck. Yeah. But I'll keep thinking to remove that limitation :) Yoshi > > Kevin > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html