Re: virtio-blk performance regression and qemu-kvm

Stefan Hajnoczi <stefanha@xxxxxxxxx> · Tue, 21 Feb 2012 17:27:49 +0000

On Tue, Feb 21, 2012 at 3:57 PM, Dongsu Park
<dongsu.park@xxxxxxxxxxxxxxxx> wrote:
> On 13.02.2012 11:57, Stefan Hajnoczi wrote:
>> On Fri, Feb 10, 2012 at 2:36 PM, Dongsu Park
>> <dongsu.park@xxxxxxxxxxxxxxxx> wrote:
>> >  Now I'm running benchmarks with both qemu-kvm 0.14.1 and 1.0.
>> >
>> >  - Sequential read (Running inside guest)
>> >   # fio -name iops -rw=read -size=1G -iodepth 1 \
>> >    -filename /dev/vdb -ioengine libaio -direct=1 -bs=4096
>> >
>> >  - Sequential write (Running inside guest)
>> >   # fio -name iops -rw=write -size=1G -iodepth 1 \
>> >    -filename /dev/vdb -ioengine libaio -direct=1 -bs=4096
>> >
>> >  For each one, I tested 3 times to get the average.
>> >
>> >  Result:
>> >
>> >  seqread with qemu-kvm 0.14.1   67,0 MByte/s
>> >  seqread with qemu-kvm 1.0      30,9 MByte/s
>> >
>> >  seqwrite with qemu-kvm 0.14.1  65,8 MByte/s
>> >  seqwrite with qemu-kvm 1.0     30,5 MByte/s
>>
>> Please retry with the following commit or simply qemu-kvm.git/master.
>> Avi discovered a performance regression which was introduced when the
>> block layer was converted to use coroutines:
>>
>> $ git describe 39a7a362e16bb27e98738d63f24d1ab5811e26a8
>> v1.0-327-g39a7a36
>>
>> (This commit is not in 1.0!)
>>
>> Please post your qemu-kvm command-line.
>>
>> 67 MB/s sequential 4 KB read means 67 * 1024 / 4 = 17152 requests per
>> second, so 58 microseconds per request.
>>
>> Please post the fio output so we can double-check what is reported.
>
> As you mentioned above, I tested it again with the revision
> v1.0-327-g39a7a36, which includes the commit 39a7a36.
>
> Result is though still not good enough.
>
> seqread   : 20.3 MByte/s
> seqwrite  : 20.1 MByte/s
> randread  : 20.5 MByte/s
> randwrite : 20.0 MByte/s
>
> My qemu-kvm commandline is like below:
>
> =======================================================================
> /usr/bin/kvm -S -M pc-0.14 -enable-kvm -m 1024 \
> -smp 1,sockets=1,cores=1,threads=1 -name mydebian3_8gb \
> -uuid d99ad012-2fcc-6f7e-fbb9-bc48b424a258 -nodefconfig -nodefaults \
> -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mydebian3_8gb.monitor,server,nowait \
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown \
> -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw \
> -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 \
> -drive file=/var/lib/libvirt/images/mydebian3_8gb.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \
> -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -drive file=/dev/ram0,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native \

I'm not sure if O_DIRECT and Linux AIO to /dev/ram0 is a good idea.
At least with tmpfs O_DIRECT does not even work - which kind of makes
sense there because tmpfs lives in the page cache.  My point here is
that ramdisk does not follow the same rules or have the same
performance characteristics as real disks do.  It's something to be
careful about.  Did you run this test because you noticed a real-world
regression?

> Here is a sample of fio output:
>
> =======================================================================
> # fio -name iops -rw=read -size=1G -iodepth 1 -filename /dev/vdb \
> -ioengine libaio -direct=1 -bs=4096
> iops: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
> Starting 1 process
> Jobs: 1 (f=1): [R] [100.0% done] [21056K/0K /s] [5140/0 iops] [eta
> 00m:00s]
> iops: (groupid=0, jobs=1): err= 0: pid=1588
>  read : io=1024MB, bw=20101KB/s, iops=5025, runt= 52166msec
>    slat (usec): min=4, max=6461, avg=24.00, stdev=19.75
>    clat (usec): min=0, max=11934, avg=169.49, stdev=113.91
>    bw (KB/s) : min=18200, max=23048, per=100.03%, avg=20106.31, stdev=934.42
>  cpu          : usr=5.43%, sys=23.25%, ctx=262363, majf=0, minf=28
>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued r/w: total=262144/0, short=0/0
>     lat (usec): 2=0.01%, 4=0.16%, 10=0.03%, 20=0.01%, 50=0.27%
>     lat (usec): 100=4.07%, 250=89.12%, 500=5.76%, 750=0.30%, 1000=0.13%
>     lat (msec): 2=0.12%, 4=0.02%, 10=0.01%, 20=0.01%
>
> Run status group 0 (all jobs):
>   READ: io=1024MB, aggrb=20100KB/s, minb=20583KB/s, maxb=20583KB/s,
> mint=52166msec, maxt=52166msec
>
> Disk stats (read/write):
>  vdb: ios=261308/0, merge=0/0, ticks=40210/0, in_queue=40110, util=77.14%
> =======================================================================
>
>
> So I think, the patch for coroutine-ucontext isn't about the bottleneck
> I'm looking for.

Try turning ioeventfd off for the virtio-blk device:

-device virtio-blk-pci,ioeventfd=off,...

You might see better performance since ramdisk I/O should be very
low-latency.  The overhead of using ioeventfd might not make it
worthwhile.  The ioeventfd feature was added post-0.14 IIRC.  Normally
it helps avoid stealing vcpu time and also causing lock contention
inside the guest - but if host I/O latency is extremely low it might
be faster to issue I/O from the vcpu thread.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html