2011/1/19 Kevin Wolf <kwolf@xxxxxxxxxx>: > Am 19.01.2011 14:04, schrieb Yoshiaki Tamura: >>>> +static void event_tap_blk_flush(EventTapBlkReq *blk_req) >>>> +{ >>>> + BlockDriverState *bs; >>>> + >>>> + bs = bdrv_find(blk_req->device_name); >>> >>> Please store the BlockDriverState in blk_req. This code loops over all >>> block devices and does a string comparison - and that for each request. >>> You can also save the qemu_strdup() when creating the request. >>> >>> In the few places where you really need the device name (might be the >>> case for load/save, I'm not sure), you can still get it from the >>> BlockDriverState. >> >> I would do so for the primary side. Although we haven't >> implemented yet, we want to replay block requests from block >> layer on the secondary side, and need device name to restore >> BlockDriverState. > > Hm, I see. I'm not happy about it, but I don't have a suggestion right > away how to avoid it. > >>> >>>> + >>>> + if (blk_req->is_flush) { >>>> + bdrv_aio_flush(bs, blk_req->reqs[0].cb, blk_req->reqs[0].opaque); >>> >>> You need to handle errors. If bdrv_aio_flush returns NULL, call the >>> callback with -EIO. >> >> I'll do so. >> >>> >>>> + return; >>>> + } >>>> + >>>> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>>> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>>> + blk_req->reqs[0].opaque); >>> >>> Same here. >>> >>>> + bdrv_flush(bs); >>> >>> This looks really strange. What is this supposed to do? >>> >>> One point is that you write it immediately after bdrv_aio_write, so you >>> get an fsync for which you don't know if it includes the current write >>> request or if it doesn't. Which data do you want to get flushed to the disk? >> >> I was expecting to flush the aio request that was just initiated. >> Am I misunderstanding the function? > > Seems so. The function names don't use really clear terminology either, > so you're not the first one to fall in this trap. Basically we have: > > * qemu_aio_flush() waits for all AIO requests to complete. I think you > wanted to have exactly this, but only for a single block device. Such a > function doesn't exist yet. > > * bdrv_flush() makes sure that all successfully completed requests are > written to disk (by calling fsync) > > * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run > the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. > >>> The other thing is that you introduce a bdrv_flush for each request, >>> basically forcing everyone to something very similar to writethrough >>> mode. I'm sure this will have a big impact on performance. >> >> The reason is to avoid inversion of queued requests. Although >> processing one-by-one is heavy, wouldn't having requests flushed >> to disk out of order break the disk image? > > No, that's fine. If a guest issues two requests at the same time, they > may complete in any order. You just need to make sure that you don't > call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. > I'm just starting to wonder if the guest won't timeout the requests if > they are queued for too long. Even more, with IDE, it can only handle > one request at a time, so not completing requests doesn't sound like a > good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. > On the other hand, if you complete before actually writing out, you > don't get timeouts, but you signal success to the guest when the request > could still fail. What would you do in this case? With a writeback cache > mode we're fine, we can just fail the next flush (until then nothing is > guaranteed to be on disk and order doesn't matter either), but with > cache=writethrough we're in serious trouble. > > Have you thought about this problem? Maybe we end up having to flush the > event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. I know that performance matters a lot, but sacrificing reliability over performance now isn't a good idea. I first want to lay the ground, and then focus on optimization. Note that without dirty bitmap optimization, Kemari suffers a lot in sending rams. Anthony and I discussed to take this approach at KVM Forum. >>> Additionally, error handling is missing. >> >> I looked at the codes using bdrv_flush and realized some of them >> doesn't handle errors, but scsi-disk.c does. Should everyone >> handle errors or depends on the usage? > > I added the return code only recently, it was a void function > previously. Probably some error handling should be added to all of them. Ah:) Glad to hear that. Yoshi > > Kevin > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html