On 10/11/2012 12:57 AM, Andrew Theurer wrote:
On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
On 10/10/2012 07:54 PM, Andrew Theurer wrote:
I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results. I think it helps to
visualize what's going on regarding the yielding.
These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record'). The Y
axis is the host cpus, each row being 10 pixels high. For these tests,
there are 80 host cpus, so the total height is 800 pixels. The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.
Each row (each host cpu) is assigned a color based on what thread is
running. vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color. There are a maximum of 12 assignable colors, so in any VMs >12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another. The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.
For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).
Here is a good example of PLE. These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).
https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
This looks very nice to visualize what is happening. Beginning of the
graph looks little messy but later it is clear.
If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler. They are pretty well aligned across all
cpus. Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test. We have 2x over-commit and we
generally see the switching of threads at 4ms. One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on. This is most likely
because of the yield_to initiated by the PLE handler. In this case
there is not that much yielding to do. It's quite clean, and the
performance is quite good.
Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.
https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
I think this link still 10x16. Could you paste the link again?
Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ
This one looks quite different. In short, it's a mess. The switching
between tasks can be lower than 10 microseconds. It basically never
recovers. There is constant yielding all the time.
Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches. While I am not recommending gang scheduling, I
think it's a good data point. The performance is 3.88x the PLE result.
https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
Yes.. we see lot of yields.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html