On 03/01/2018 15:39, Brian Spraker -- BsnTech wrote: > On 1/2/2018 6:09 PM, Paolo Bonzini wrote: >> On 02/01/2018 19:21, Brian Spraker -- BsnTech wrote: >>> Hello all, >>> >>> Just recently purchased a Samsung 960 Evo NVMe drive. It is on a PCI >>> Express add-in card in a PCIe 2.0 x4 slot since the motherboard itself >>> does not have an M.2 slot. >>> >>> On the main host machine, I ran "hdparm -Tt --direct /dev/nvme0n1" and >>> get the following: >>> >>> Timing O_DIRECT cached reads: 2578 MB in 2.00 seconds = 1289.17 >>> MB/sec >>> Timing O_DIRECT disk reads: 3962 MB in 3.00 seconds = 1320.20 MB/sec >>> >>> The drive has one Ext4 partition on it. Mounted with "noatime" to the >>> /VMs mount point. >>> >>> Host server is running Ubuntu server 16.04 and running "kvm --version" >>> shows QEMU emulator version 2.5.0. >>> >>> When I run the same "hdparm -Tt --direct /dev/vda" on the hosts, it >>> shows quite a bit less: >>> >>> Timing O_DIRECT cached reads: 1456 MB in 2.00 seconds = 727.51 MB/sec >>> Timing O_DIRECT disk reads: 2730 MB in 3.00 seconds = 909.76 MB/sec >>> >>> Guest machines are setup with 10 GB of memory, 8 CPU (host CPU config >>> copied), virtio for disk bus, raw image file, with no cache. Guest >>> machines are also Ubuntu 16.04. >>> >>> Before the upgrade, I was using an SSD in SATA III mode. The host and >>> guest disk reads were only a few MB/sec apart. >>> >>> Is there something else I need to look at to see why the guests are only >>> getting about half the throughput of the drive? >> It's hard to say this conclusively without knowing the exact I/O pattern >> that hdparm is using or the results for your SATA SSD. >> >> However, I'd guess your new disk is faster and you're now CPU bound. >> Unfortunately, the cost of an interrupt is roughly doubled by >> virtualization (because you have to go disk->host->QEMU->guest) and so >> is the latency. >> >> If hdparm is only issuing 1 I/O operation at a time, throughput is the >> reciprocal of the latency: double the latency, and the throughput is >> halved. >> >> Try using aio=native and adding a dedicated iothread for the disk. That >> can give you better throughput, especially if the queue depth (# of I/O >> operations active at any one time during the benchmark) is >1. >> >> Thanks, >> >> Paolo > Thank you Paolo. I changed aio=native and this did increase > performance. Now the results on the guest are: > > Timing O_DIRECT cached reads: 2252 MB in 2.00 seconds = 1125.55 MB/sec > Timing O_DIRECT disk reads: 2874 MB in 3.00 seconds = 957.47 MB/sec > > The cached reads are quite a bit different and the disk reads went up by > about 50 MB/sec - still 400 MB/sec under what the host can do. > > The machine has an eight core AMD FX-8350 processor and on average, it > is only using 5% CPU capacity. Very underutilized but don't know if > that makes any difference or not. When I say CPU bound, I mean bound by the latency of the CPU's job. That is: - for the host OS, responding to the interrupt and signaling completion of the I/O operation to QEMU; - for QEMU, responding to the completion and forwarding it to the guest OS; - for the guest OS, responding to the interrupt and signaling completion of the I/O operation to hdparm; - and all the way back from hdparm to the host OS > Any guidance on how to make a dedicated iothread for the disk? Add this inside <domain> <iothreads>1</iothreads> and iothread='1' to the <driver> element. Thanks, Paolo > For the SATA SSD results, this is what it reads on the host: > > /dev/sda: > Timing O_DIRECT cached reads: 926 MB in 2.00 seconds = 462.39 MB/sec > Timing O_DIRECT disk reads: 1426 MB in 3.00 seconds = 475.31 MB/sec > > And on the guest: > > /dev/vda: > Timing O_DIRECT cached reads: 1166 MB in 2.00 seconds = 582.71 MB/sec > Timing O_DIRECT disk reads: 792 MB in 3.02 seconds = 262.52 MB/sec > > So I was mistaken - the cached reads is somehow higher on the guest but > the disk reads is 200 MB/sec slower. >