On 11/21/2012 07:57 PM, Michael S. Tsirkin wrote: > On Wed, Nov 21, 2012 at 12:24:55PM +0800, Asias He wrote: >> On 11/20/2012 09:37 PM, Michael S. Tsirkin wrote: >>> On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote: >>>> On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote: >>>>> On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote: >>>>>> vhost-blk is an in-kernel virito-blk device accelerator. >>>>>> >>>>>> Due to lack of proper in-kernel AIO interface, this version converts >>>>>> guest's I/O request to bio and use submit_bio() to submit I/O directly. >>>>>> So this version any supports raw block device as guest's disk image, >>>>>> e.g. /dev/sda, /dev/ram0. We can add file based image support to >>>>>> vhost-blk once we have in-kernel AIO interface. There are some work in >>>>>> progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown: >>>>>> >>>>>> http://marc.info/?l=linux-fsdevel&m=133312234313122 >>>>>> >>>>>> Performance evaluation: >>>>>> ----------------------------- >>>>>> 1) LKVM >>>>>> Fio with libaio ioengine on Fusion IO device using kvm tool >>>>>> IOPS(k) Before After Improvement >>>>>> seq-read 107 121 +13.0% >>>>>> seq-write 130 179 +37.6% >>>>>> rnd-read 102 122 +19.6% >>>>>> rnd-write 125 159 +27.0% >>>>>> >>>>>> 2) QEMU >>>>>> Fio with libaio ioengine on Fusion IO device using QEMU >>>>>> IOPS(k) Before After Improvement >>>>>> seq-read 76 123 +61.8% >>>>>> seq-write 139 173 +24.4% >>>>>> rnd-read 73 120 +64.3% >>>>>> rnd-write 75 156 +108.0% >>>>> >>>>> Could you compare with dataplane qemu as well please? >>>> >>>> >>>> Well, I will try to collect it. >>>> >>>>> >>>>>> >>>>>> Userspace bits: >>>>>> ----------------------------- >>>>>> 1) LKVM >>>>>> The latest vhost-blk userspace bits for kvm tool can be found here: >>>>>> git@xxxxxxxxxx:asias/linux-kvm.git blk.vhost-blk >>>>>> >>>>>> 2) QEMU >>>>>> The latest vhost-blk userspace prototype for QEMU can be found here: >>>>>> git@xxxxxxxxxx:asias/qemu.git blk.vhost-blk >>>>>> >>>>>> Changes in v5: >>>>>> - Do not assume the buffer layout >>>>>> - Fix wakeup race >>>>>> >>>>>> Changes in v4: >>>>>> - Mark req->status as userspace pointer >>>>>> - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status() >>>>>> - Add if (need_resched()) schedule() in blk thread >>>>>> - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop() >>>>>> - Use vq_err() instead of pr_warn() >>>>>> - Fail un Unsupported request >>>>>> - Add flush in vhost_blk_set_features() >>>>>> >>>>>> Changes in v3: >>>>>> - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph! >>>>>> - Check file passed by user is a raw block device file >>>>>> >>>>>> Signed-off-by: Asias He <asias@xxxxxxxxxx> >>>>> >>>>> Since there are files shared by this and vhost net >>>>> it's easiest for me to merge this all through the >>>>> vhost tree. >>>>> >>>>> Jens, could you ack this and the bio usage in this driver >>>>> please? >>>>> >>>>>> --- >>>>>> drivers/vhost/Kconfig | 1 + >>>>>> drivers/vhost/Kconfig.blk | 10 + >>>>>> drivers/vhost/Makefile | 2 + >>>>>> drivers/vhost/blk.c | 697 ++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> drivers/vhost/blk.h | 8 + >>>>>> 5 files changed, 718 insertions(+) >>>>>> create mode 100644 drivers/vhost/Kconfig.blk >>>>>> create mode 100644 drivers/vhost/blk.c >>>>>> create mode 100644 drivers/vhost/blk.h >>>>>> >>>>>> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig >>>>>> index 202bba6..acd8038 100644 >>>>>> --- a/drivers/vhost/Kconfig >>>>>> +++ b/drivers/vhost/Kconfig >>>>>> @@ -11,4 +11,5 @@ config VHOST_NET >>>>>> >>>>>> if STAGING >>>>>> source "drivers/vhost/Kconfig.tcm" >>>>>> +source "drivers/vhost/Kconfig.blk" >>>>>> endif >>>>>> diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk >>>>>> new file mode 100644 >>>>>> index 0000000..ff8ab76 >>>>>> --- /dev/null >>>>>> +++ b/drivers/vhost/Kconfig.blk >>>>>> @@ -0,0 +1,10 @@ >>>>>> +config VHOST_BLK >>>>>> + tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)" >>>>>> + depends on BLOCK && EXPERIMENTAL && m >>>>>> + ---help--- >>>>>> + This kernel module can be loaded in host kernel to accelerate >>>>>> + guest block with virtio_blk. Not to be confused with virtio_blk >>>>>> + module itself which needs to be loaded in guest kernel. >>>>>> + >>>>>> + To compile this driver as a module, choose M here: the module will >>>>>> + be called vhost_blk. >>>>>> diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile >>>>>> index a27b053..1a8a4a5 100644 >>>>>> --- a/drivers/vhost/Makefile >>>>>> +++ b/drivers/vhost/Makefile >>>>>> @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o >>>>>> vhost_net-y := vhost.o net.o >>>>>> >>>>>> obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o >>>>>> +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o >>>>>> +vhost_blk-y := blk.o >>>>>> diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c >>>>>> new file mode 100644 >>>>>> index 0000000..f0f118a >>>>>> --- /dev/null >>>>>> +++ b/drivers/vhost/blk.c >>>>>> @@ -0,0 +1,697 @@ >>>>>> +/* >>>>>> + * Copyright (C) 2011 Taobao, Inc. >>>>>> + * Author: Liu Yuan <tailai.ly@xxxxxxxxxx> >>>>>> + * >>>>>> + * Copyright (C) 2012 Red Hat, Inc. >>>>>> + * Author: Asias He <asias@xxxxxxxxxx> >>>>>> + * >>>>>> + * This work is licensed under the terms of the GNU GPL, version 2. >>>>>> + * >>>>>> + * virtio-blk server in host kernel. >>>>>> + */ >>>>>> + >>>>>> +#include <linux/miscdevice.h> >>>>>> +#include <linux/module.h> >>>>>> +#include <linux/vhost.h> >>>>>> +#include <linux/virtio_blk.h> >>>>>> +#include <linux/mutex.h> >>>>>> +#include <linux/file.h> >>>>>> +#include <linux/kthread.h> >>>>>> +#include <linux/blkdev.h> >>>>>> +#include <linux/llist.h> >>>>>> + >>>>>> +#include "vhost.c" >>>>>> +#include "vhost.h" >>>>>> +#include "blk.h" >>>>>> + >>>>>> +static DEFINE_IDA(vhost_blk_index_ida); >>>>>> + >>>>>> +enum { >>>>>> + VHOST_BLK_VQ_REQ = 0, >>>>>> + VHOST_BLK_VQ_MAX = 1, >>>>>> +}; >>>>>> + >>>>>> +struct req_page_list { >>>>>> + struct page **pages; >>>>>> + int pages_nr; >>>>>> +}; >>>>>> + >>>>>> +struct vhost_blk_req { >>>>>> + struct llist_node llnode; >>>>>> + struct req_page_list *pl; >>>>>> + struct vhost_blk *blk; >>>>>> + >>>>>> + struct iovec *iov; >>>>>> + int iov_nr; >>>>>> + >>>>>> + struct bio **bio; >>>>>> + atomic_t bio_nr; >>>>>> + >>>>>> + struct iovec status[1]; >>>>>> + >>>>>> + sector_t sector; >>>>>> + int write; >>>>>> + u16 head; >>>>>> + long len; >>>>>> +}; >>>>>> + >>>>>> +struct vhost_blk { >>>>>> + struct task_struct *host_kick; >>>>> >>>>> Could you please add a comment here explaining why >>>>> is this better than using the builtin vhost thread? >>>> >>>> With separate thread model, the request submit and request completion >>>> can be handled in parallel. I have measured significant performance >>>> improvement with this model. >>>> In the long term, It would be nice if the >>>> vhost core can provide multiple thread support. >>> >>> Strange, all completion does it write out the used ring. >>> I am guessing you didn't complete requests >>> in a timely manner which was what caused the issue. >> >> Previously, there was two 'struct vhost_work' work which poll on the >> guest kick eventfd and host aio eventfd. The vhost thread is handling >> the works when there are data in eventfd, either guest kick or aio >> completion. > > See VHOST_NET_WEIGHT in how in -net we prevent starvation > between tx and rx. Looks interesting. >>> OTOH I'm not sure how increasing the number of threads >>> scales with # of VMs. Increased scheduler pressure is >>> also a concern. >> >> Agree, this is also my concern. However, this is a trade-off. we have a >> price to pay for the parallelism. > > If the work offloaded is trivial in size then the gain is likely > to be marginal. > >> Also, one thread per vhost device >> might not scale with # of VMs too, compared to e.g. per cpu thread. > > I'e no idea how to do QoS with a per cpu thread otherwise > >>> Did you try checking completion after each >>> submitted request to address this? >> >> Isn't this a sync operation? We can not only account on the checking the >> completion in the submit path. Another polling on the completion is >> needed anyway. Mixing the completion in submit path, 1) makes the submit >> delay longer 2) does not help with the case where a single request takes >> relatively long time to finish. IMHO, I do not think mixing the submit >> and completion is a good idea. > > To clarify, you would > 1. do llist_del_all in handle_kick - > the delay this adds should be trivial > 2. for completion queue work on vhost thread - > this can be as simple as waking up the regular > kick handler I am trying the one thread model. If it works fine, I will switch to it in v6. > You can further reduce overhead by batching > the wakeups in 2. See my patch > vhost-net: reduce vq polling on tx zerocopy > that does this for -net. okay. thanks! -- Asias -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html