Re: KVM / Use of NVMe Drives & Throughput

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 3 Jan 2018 16:36:19 +0100

On 03/01/2018 15:39, Brian Spraker -- BsnTech wrote:
> On 1/2/2018 6:09 PM, Paolo Bonzini wrote:
>> On 02/01/2018 19:21, Brian Spraker -- BsnTech wrote:
>>> Hello all,
>>>
>>> Just recently purchased a Samsung 960 Evo NVMe drive.  It is on a PCI
>>> Express add-in card  in a PCIe 2.0 x4 slot since the motherboard itself
>>> does not have an M.2 slot.
>>>
>>> On the main host machine, I ran "hdparm -Tt --direct /dev/nvme0n1" and
>>> get the following:
>>>
>>> Timing O_DIRECT cached reads:   2578 MB in  2.00 seconds = 1289.17
>>> MB/sec
>>> Timing O_DIRECT disk reads: 3962 MB in  3.00 seconds = 1320.20 MB/sec
>>>
>>> The drive has one Ext4 partition on it.  Mounted with "noatime" to the
>>> /VMs mount point.
>>>
>>> Host server is running Ubuntu server 16.04 and running "kvm --version"
>>> shows QEMU emulator version 2.5.0.
>>>
>>> When I run the same "hdparm -Tt --direct /dev/vda" on the hosts, it
>>> shows quite a bit less:
>>>
>>> Timing O_DIRECT cached reads:   1456 MB in  2.00 seconds = 727.51 MB/sec
>>> Timing O_DIRECT disk reads: 2730 MB in  3.00 seconds = 909.76 MB/sec
>>>
>>> Guest machines are setup with 10 GB of memory, 8 CPU (host CPU config
>>> copied), virtio for disk bus, raw image file, with no cache. Guest
>>> machines are also Ubuntu 16.04.
>>>
>>> Before the upgrade, I was using an SSD in SATA III mode.  The host and
>>> guest disk reads were only a few MB/sec apart.
>>>
>>> Is there something else I need to look at to see why the guests are only
>>> getting about half the throughput of the drive?
>> It's hard to say this conclusively without knowing the exact I/O pattern
>> that hdparm is using or the results for your SATA SSD.
>>
>> However, I'd guess your new disk is faster and you're now CPU bound.
>> Unfortunately, the cost of an interrupt is roughly doubled by
>> virtualization (because you have to go disk->host->QEMU->guest) and so
>> is the latency.
>>
>> If hdparm is only issuing 1 I/O operation at a time, throughput is the
>> reciprocal of the latency: double the latency, and the throughput is
>> halved.
>>
>> Try using aio=native and adding a dedicated iothread for the disk.  That
>> can give you better throughput, especially if the queue depth (# of I/O
>> operations active at any one time during the benchmark) is >1.
>>
>> Thanks,
>>
>> Paolo
> Thank you Paolo.  I changed aio=native and this did increase
> performance.  Now the results on the guest are:
> 
> Timing O_DIRECT cached reads:   2252 MB in  2.00 seconds = 1125.55 MB/sec
> Timing O_DIRECT disk reads: 2874 MB in  3.00 seconds = 957.47 MB/sec
> 
> The cached reads are quite a bit different and the disk reads went up by
> about 50 MB/sec - still 400 MB/sec under what the host can do.
> 
> The machine has an eight core AMD FX-8350 processor and on average, it
> is only using 5% CPU capacity.  Very underutilized but don't know if
> that makes any difference or not.

When I say CPU bound, I mean bound by the latency of the CPU's job.
That is:

- for the host OS, responding to the interrupt and signaling completion
of the I/O operation to QEMU;

- for QEMU, responding to the completion and forwarding it to the guest OS;

- for the guest OS, responding to the interrupt and signaling completion
of the I/O operation to hdparm;

- and all the way back from hdparm to the host OS

> Any guidance on how to make a dedicated iothread for the disk?

Add this inside <domain>

   <iothreads>1</iothreads>

and iothread='1' to the <driver> element.

Thanks,

Paolo

> For the SATA SSD results, this is what it reads on the host:
> 
> /dev/sda:
>  Timing O_DIRECT cached reads:   926 MB in  2.00 seconds = 462.39 MB/sec
>  Timing O_DIRECT disk reads: 1426 MB in  3.00 seconds = 475.31 MB/sec
> 
> And on the guest:
> 
> /dev/vda:
>  Timing O_DIRECT cached reads:   1166 MB in  2.00 seconds = 582.71 MB/sec
>  Timing O_DIRECT disk reads: 792 MB in  3.02 seconds = 262.52 MB/sec
> 
> So I was mistaken - the cached reads is somehow higher on the guest but
> the disk reads is 200 MB/sec slower.
>