Re: KVM / Use of NVMe Drives & Throughput

Brian Spraker -- BsnTech <brian@xxxxxxxxxxx> · Wed, 3 Jan 2018 08:39:55 -0600

On 1/2/2018 6:09 PM, Paolo Bonzini wrote:
On 02/01/2018 19:21, Brian Spraker -- BsnTech wrote:
Hello all,

Just recently purchased a Samsung 960 Evo NVMe drive.  It is on a PCI
Express add-in card  in a PCIe 2.0 x4 slot since the motherboard itself
does not have an M.2 slot.

On the main host machine, I ran "hdparm -Tt --direct /dev/nvme0n1" and
get the following:

Timing O_DIRECT cached reads:   2578 MB in  2.00 seconds = 1289.17 MB/sec
Timing O_DIRECT disk reads: 3962 MB in  3.00 seconds = 1320.20 MB/sec

The drive has one Ext4 partition on it.  Mounted with "noatime" to the
/VMs mount point.

Host server is running Ubuntu server 16.04 and running "kvm --version"
shows QEMU emulator version 2.5.0.

When I run the same "hdparm -Tt --direct /dev/vda" on the hosts, it
shows quite a bit less:

Timing O_DIRECT cached reads:   1456 MB in  2.00 seconds = 727.51 MB/sec
Timing O_DIRECT disk reads: 2730 MB in  3.00 seconds = 909.76 MB/sec

Guest machines are setup with 10 GB of memory, 8 CPU (host CPU config
copied), virtio for disk bus, raw image file, with no cache. Guest
machines are also Ubuntu 16.04.

Before the upgrade, I was using an SSD in SATA III mode.  The host and
guest disk reads were only a few MB/sec apart.

Is there something else I need to look at to see why the guests are only
getting about half the throughput of the drive?
It's hard to say this conclusively without knowing the exact I/O pattern
that hdparm is using or the results for your SATA SSD.

However, I'd guess your new disk is faster and you're now CPU bound.
Unfortunately, the cost of an interrupt is roughly doubled by
virtualization (because you have to go disk->host->QEMU->guest) and so
is the latency.

If hdparm is only issuing 1 I/O operation at a time, throughput is the
reciprocal of the latency: double the latency, and the throughput is halved.

Try using aio=native and adding a dedicated iothread for the disk.  That
can give you better throughput, especially if the queue depth (# of I/O
operations active at any one time during the benchmark) is >1.

Thanks,

Paolo
Thank you Paolo.  I changed aio=native and this did increase 
performance.  Now the results on the guest are:

Timing O_DIRECT cached reads:   2252 MB in  2.00 seconds = 1125.55 MB/sec
Timing O_DIRECT disk reads: 2874 MB in  3.00 seconds = 957.47 MB/sec

The cached reads are quite a bit different and the disk reads went up by 
about 50 MB/sec - still 400 MB/sec under what the host can do.

The machine has an eight core AMD FX-8350 processor and on average, it 
is only using 5% CPU capacity.  Very underutilized but don't know if 
that makes any difference or not. Any guidance on how to make a 
dedicated iothread for the disk?

For the SATA SSD results, this is what it reads on the host:

/dev/sda:
 Timing O_DIRECT cached reads:   926 MB in  2.00 seconds = 462.39 MB/sec
 Timing O_DIRECT disk reads: 1426 MB in  3.00 seconds = 475.31 MB/sec

And on the guest:

/dev/vda:
 Timing O_DIRECT cached reads:   1166 MB in  2.00 seconds = 582.71 MB/sec
 Timing O_DIRECT disk reads: 792 MB in  3.02 seconds = 262.52 MB/sec

So I was mistaken - the cached reads is somehow higher on the guest but 
the disk reads is 200 MB/sec slower.