On Fri, Jul 29, 2011 at 1:01 PM, Liu Yuan <namei.unix@xxxxxxxxx> wrote: > On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote: >> >> I mean did you investigate *why* userspace virtio-blk has higher >> latency? Did you profile it and drill down on its performance? >> >> It's important to understand what is going on before replacing it with >> another mechanism. What I'm saying is, if I have a buggy program I >> can sometimes rewrite it from scratch correctly but that doesn't tell >> me what the bug was. >> >> Perhaps the inefficiencies in userspace virtio-blk can be solved by >> adjusting the code (removing inefficient notification mechanisms, >> introducing a dedicated thread outside of the QEMU iothread model, >> etc). Then we'd get the performance benefit for non-raw images and >> perhaps non-virtio and non-Linux host platforms too. >> > > As Christoph mentioned, the unnecessary memory allocation and too much cache > line unfriendly > function pointers might be culprit. For example, the read quests code path > for linux aio would be > > > qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output > ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes > again nested called!)->raw_aio_readv->laio_submit->io_submit... > > Looking at this long list,most are function pointers that can not be > inlined, and the internal data structures used by these functions are > dozons. Leave aside code complexity, this long code path would really need > retrofit. As Christoph simply put, this kind of mess is inherent all over > the qemu code. So I am afraid, the 'retrofit' would end up to be a re-write > the entire (sub)system. I have to admit that, I am inclined to the MST's > vhost approach, that write a new subsystem other than tedious profiling and > fixing, that would possibly goes as far as actually re-writing it. I'm totally for vhost-blk if there are unique benefits that make it worth maintaining. But better benchmark results are not a cause, they are an effect. So the thing to do is to drill down on both vhost-blk and userspace virtio-blk to understand what causes overheads. Evidence showing that userspace can never compete is needed to justify vhost-blk IMO. >>> Actually, the motivation to start vhost-blk is that, in our observation, >>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO >>> perspective, especially for sequential read/write (around 20% gap). >>> >>> We'll deploy a large number of KVM-based systems as the infrastructure of >>> some service and this gap is really unpleasant. >>> >>> By the design, IMHO, virtio performance is supposed to be comparable to >>> the >>> para-vulgarization solution if not better, because for KVM, guest and >>> backend driver could sit in the same address space via mmaping. This >>> would >>> reduce the overhead involved in page table modification, thus speed up >>> the >>> buffer management and transfer a lot compared with Xen PV. >> >> Yes, guest memory is just a region of QEMU userspace memory. So it's >> easy to reach inside and there are no page table tricks or copying >> involved. >> >>> I am not in a qualified position to talk about QEMU , but I think the >>> surprised performance improvement by this very primitive vhost-blk simply >>> manifest that, the internal structure for qemu io is the way bloated. I >>> say >>> it *surprised* because basically vhost just reduces the number of system >>> calls, which is heavily tuned by chip manufacture for years. So, I guess >>> the >>> performance number vhost-blk gains mainly could possibly be contributed >>> to >>> *shorter and simpler* code path. >> >> First we need to understand exactly what the latency overhead is. If >> we discover that it's simply not possible to do this equally well in >> userspace, then it makes perfect sense to use vhost-blk. >> >> So let's gather evidence and learn what the overheads really are. >> Last year I spent time looking at virtio-blk latency: >> http://www.linux-kvm.org/page/Virtio/Block/Latency >> > > Nice stuff. > >> See especially this diagram: >> http://www.linux-kvm.org/page/Image:Threads.png >> >> The goal wasn't specifically to reduce synchronous sequential I/O, >> instead the aim was to reduce overheads for a variety of scenarios, >> especially multithreaded workloads. >> >> In most cases it was helpful to move I/O submission out of the vcpu >> thread by using the ioeventfd model just like vhost. Ioeventfd for >> userspace virtio-blk is now on by default in qemu-kvm. >> >> Try running the userspace virtio-blk benchmark with -drive >> if=none,id=drive0,file=... -device >> virtio-blk-pci,drive=drive0,ioeventfd=off. This causes QEMU to do I/O >> submission in the vcpu thread, which might reduce latency at the cost >> of stealing guest time. >> >>> Anyway, IMHO, compared with user space approach, the in-kernel one would >>> allow more flexibility and better integration with the kernel IO stack, >>> since we don't need two IO stacks for guest OS. >> >> I agree that there may be advantages to integrating with in-kernel I/O >> mechanisms. An interesting step would be to implement the >> submit_bio() approach that Christoph suggested and seeing if that >> improves things further. >> >> Push virtio-blk as far as you can and let's see what the performance is! >> >>>> I have a hacked up world here that basically implements vhost-blk in >>>> userspace: >>>> >>>> >>>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c >>>> >>>> * A dedicated virtqueue thread sleeps on ioeventfd >>>> * Guest memory is pre-mapped and accessed directly (not using QEMU's >>>> usually memory access functions) >>>> * Linux AIO is used, the QEMU block layer is bypassed >>>> * Completion interrupts are injected from the virtqueue thread using >>>> ioctl >>>> >>>> I will try to rebase onto qemu-kvm.git/master (this work is several >>>> months old). Then we can compare to see how much of the benefit can >>>> be gotten in userspace. >>>> >>> I don't really get you about vhost-blk in user space since vhost >>> infrastructure itself means an in-kernel accelerator that implemented in >>> kernel . I guess what you meant is somewhat a re-write of virtio-blk in >>> user >>> space with a dedicated thread handling requests, and shorter code path >>> similar to vhost-blk. >> >> Right - it's the same model as vhost: a dedicated thread listening for >> ioeventfd virtqueue kicks and processing them out-of-line with the >> guest and userspace QEMU's traditional vcpu and iothread. >> >> When you say "IOPS drops drastically" do you mean that it gets worse >> than with queue-depth=1? >> > > Yes, on my laptop, when iodepth = 3, IOPS in my host drops to about 3,500 > from 13K! and so is iodepth = 4 in my guest during FIO seq read test. This > should never happen. Yes, that doesn't make sense to me unless the I/O scheduler is doing something weird. Have you tried switching between cfq, deadline, and noop? Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html