On Wed, Nov 21, 2012 at 12:24:55PM +0800, Asias He wrote: > On 11/20/2012 09:37 PM, Michael S. Tsirkin wrote: > > On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote: > >> On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote: > >>> On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote: > >>>> vhost-blk is an in-kernel virito-blk device accelerator. > >>>> > >>>> Due to lack of proper in-kernel AIO interface, this version converts > >>>> guest's I/O request to bio and use submit_bio() to submit I/O directly. > >>>> So this version any supports raw block device as guest's disk image, > >>>> e.g. /dev/sda, /dev/ram0. We can add file based image support to > >>>> vhost-blk once we have in-kernel AIO interface. There are some work in > >>>> progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown: > >>>> > >>>> http://marc.info/?l=linux-fsdevel&m=133312234313122 > >>>> > >>>> Performance evaluation: > >>>> ----------------------------- > >>>> 1) LKVM > >>>> Fio with libaio ioengine on Fusion IO device using kvm tool > >>>> IOPS(k) Before After Improvement > >>>> seq-read 107 121 +13.0% > >>>> seq-write 130 179 +37.6% > >>>> rnd-read 102 122 +19.6% > >>>> rnd-write 125 159 +27.0% > >>>> > >>>> 2) QEMU > >>>> Fio with libaio ioengine on Fusion IO device using QEMU > >>>> IOPS(k) Before After Improvement > >>>> seq-read 76 123 +61.8% > >>>> seq-write 139 173 +24.4% > >>>> rnd-read 73 120 +64.3% > >>>> rnd-write 75 156 +108.0% > >>> > >>> Could you compare with dataplane qemu as well please? > >> > >> > >> Well, I will try to collect it. > >> > >>> > >>>> > >>>> Userspace bits: > >>>> ----------------------------- > >>>> 1) LKVM > >>>> The latest vhost-blk userspace bits for kvm tool can be found here: > >>>> git@xxxxxxxxxx:asias/linux-kvm.git blk.vhost-blk > >>>> > >>>> 2) QEMU > >>>> The latest vhost-blk userspace prototype for QEMU can be found here: > >>>> git@xxxxxxxxxx:asias/qemu.git blk.vhost-blk > >>>> > >>>> Changes in v5: > >>>> - Do not assume the buffer layout > >>>> - Fix wakeup race > >>>> > >>>> Changes in v4: > >>>> - Mark req->status as userspace pointer > >>>> - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() > >>>> - Add if (need_resched()) schedule() in blk thread > >>>> - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() > >>>> - Use vq_err() instead of pr_warn() > >>>> - Fail un Unsupported request > >>>> - Add flush in vhost_blk_set_features() > >>>> > >>>> Changes in v3: > >>>> - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! > >>>> - Check file passed by user is a raw block device file > >>>> > >>>> Signed-off-by: Asias He <asias@xxxxxxxxxx> > >>> > >>> Since there are files shared by this and vhost net > >>> it's easiest for me to merge this all through the > >>> vhost tree. > >>> > >>> Jens, could you ack this and the bio usage in this driver > >>> please? > >>> > >>>> --- > >>>> drivers/vhost/Kconfig | 1 + > >>>> drivers/vhost/Kconfig.blk | 10 + > >>>> drivers/vhost/Makefile | 2 + > >>>> drivers/vhost/blk.c | 697 ++++++++++++++++++++++++++++++++++++++++++++++ > >>>> drivers/vhost/blk.h | 8 + > >>>> 5 files changed, 718 insertions(+) > >>>> create mode 100644 drivers/vhost/Kconfig.blk > >>>> create mode 100644 drivers/vhost/blk.c > >>>> create mode 100644 drivers/vhost/blk.h > >>>> > >>>> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig > >>>> index 202bba6..acd8038 100644 > >>>> --- a/drivers/vhost/Kconfig > >>>> +++ b/drivers/vhost/Kconfig > >>>> @@ -11,4 +11,5 @@ config VHOST_NET > >>>> > >>>> if STAGING > >>>> source "drivers/vhost/Kconfig.tcm" > >>>> +source "drivers/vhost/Kconfig.blk" > >>>> endif > >>>> diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk > >>>> new file mode 100644 > >>>> index 0000000..ff8ab76 > >>>> --- /dev/null > >>>> +++ b/drivers/vhost/Kconfig.blk > >>>> @@ -0,0 +1,10 @@ > >>>> +config VHOST_BLK > >>>> + tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)" > >>>> + depends on BLOCK && EXPERIMENTAL && m > >>>> + ---help--- > >>>> + This kernel module can be loaded in host kernel to accelerate > >>>> + guest block with virtio_blk. Not to be confused with virtio_blk > >>>> + module itself which needs to be loaded in guest kernel. > >>>> + > >>>> + To compile this driver as a module, choose M here: the module will > >>>> + be called vhost_blk. > >>>> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile > >>>> index a27b053..1a8a4a5 100644 > >>>> --- a/drivers/vhost/Makefile > >>>> +++ b/drivers/vhost/Makefile > >>>> @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o > >>>> vhost_net-y := vhost.o net.o > >>>> > >>>> obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o > >>>> +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o > >>>> +vhost_blk-y := blk.o > >>>> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c > >>>> new file mode 100644 > >>>> index 0000000..f0f118a > >>>> --- /dev/null > >>>> +++ b/drivers/vhost/blk.c > >>>> @@ -0,0 +1,697 @@ > >>>> +/* > >>>> + * Copyright (C) 2011 Taobao, Inc. > >>>> + * Author: Liu Yuan <tailai.ly@xxxxxxxxxx> > >>>> + * > >>>> + * Copyright (C) 2012 Red Hat, Inc. > >>>> + * Author: Asias He <asias@xxxxxxxxxx> > >>>> + * > >>>> + * This work is licensed under the terms of the GNU GPL, version 2. > >>>> + * > >>>> + * virtio-blk server in host kernel. > >>>> + */ > >>>> + > >>>> +#include <linux/miscdevice.h> > >>>> +#include <linux/module.h> > >>>> +#include <linux/vhost.h> > >>>> +#include <linux/virtio_blk.h> > >>>> +#include <linux/mutex.h> > >>>> +#include <linux/file.h> > >>>> +#include <linux/kthread.h> > >>>> +#include <linux/blkdev.h> > >>>> +#include <linux/llist.h> > >>>> + > >>>> +#include "vhost.c" > >>>> +#include "vhost.h" > >>>> +#include "blk.h" > >>>> + > >>>> +static DEFINE_IDA(vhost_blk_index_ida); > >>>> + > >>>> +enum { > >>>> + VHOST_BLK_VQ_REQ = 0, > >>>> + VHOST_BLK_VQ_MAX = 1, > >>>> +}; > >>>> + > >>>> +struct req_page_list { > >>>> + struct page **pages; > >>>> + int pages_nr; > >>>> +}; > >>>> + > >>>> +struct vhost_blk_req { > >>>> + struct llist_node llnode; > >>>> + struct req_page_list *pl; > >>>> + struct vhost_blk *blk; > >>>> + > >>>> + struct iovec *iov; > >>>> + int iov_nr; > >>>> + > >>>> + struct bio **bio; > >>>> + atomic_t bio_nr; > >>>> + > >>>> + struct iovec status[1]; > >>>> + > >>>> + sector_t sector; > >>>> + int write; > >>>> + u16 head; > >>>> + long len; > >>>> +}; > >>>> + > >>>> +struct vhost_blk { > >>>> + struct task_struct *host_kick; > >>> > >>> Could you please add a comment here explaining why > >>> is this better than using the builtin vhost thread? > >> > >> With separate thread model, the request submit and request completion > >> can be handled in parallel. I have measured significant performance > >> improvement with this model. > >> In the long term, It would be nice if the > >> vhost core can provide multiple thread support. > > > > Strange, all completion does it write out the used ring. > > I am guessing you didn't complete requests > > in a timely manner which was what caused the issue. > > Previously, there was two 'struct vhost_work' work which poll on the > guest kick eventfd and host aio eventfd. The vhost thread is handling > the works when there are data in eventfd, either guest kick or aio > completion. See VHOST_NET_WEIGHT in how in -net we prevent starvation between tx and rx. > > OTOH I'm not sure how increasing the number of threads > > scales with # of VMs. Increased scheduler pressure is > > also a concern. > > Agree, this is also my concern. However, this is a trade-off. we have a > price to pay for the parallelism. If the work offloaded is trivial in size then the gain is likely to be marginal. > Also, one thread per vhost device > might not scale with # of VMs too, compared to e.g. per cpu thread. I'e no idea how to do QoS with a per cpu thread otherwise > > Did you try checking completion after each > > submitted request to address this? > > Isn't this a sync operation? We can not only account on the checking the > completion in the submit path. Another polling on the completion is > needed anyway. Mixing the completion in submit path, 1) makes the submit > delay longer 2) does not help with the case where a single request takes > relatively long time to finish. IMHO, I do not think mixing the submit > and completion is a good idea. To clarify, you would 1. do llist_del_all in handle_kick - the delay this adds should be trivial 2. for completion queue work on vhost thread - this can be as simple as waking up the regular kick handler You can further reduce overhead by batching the wakeups in 2. See my patch vhost-net: reduce vq polling on tx zerocopy that does this for -net. > -- > Asias -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html