Re: [RFC 00/10] KVM: Add TMEM host/guest support

Avi Kivity <avi@xxxxxxxxxx> · Mon, 11 Jun 2012 14:45:35 +0300

On 06/11/2012 01:26 PM, Sasha Levin wrote:
>> 
>> Strange that system time is lower with cache=writeback.
> 
> Maybe because these pages don't get written out immediately? I don't
> have a better guess.

>From the guest point of view, it's the same flow.  btw, this is a read,
so the difference would be readahead, not write-behind, but the
difference in system time is still unexplained.

> 
>> > And finally, KVM TMEM on, caching=none:
>> > 
>> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 	2048+0 records in
>> > 	2048+0 records out
>> > 	8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s
>> > 
>> > 	real    1m59.123s
>> > 	user    0m0.020s
>> > 	sys     0m29.336s
>> > 
>> > 	sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048
>> > 	2048+0 records in
>> > 	2048+0 records out
>> > 	8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s
>> > 
>> > 	real    0m36.950s
>> > 	user    0m0.005s
>> > 	sys     0m35.308s
>> 
>> So system time more than doubled compared to non-tmem cache=none.  The
>> overhead per page is 17s / (8589934592/4096) = 8.1usec.  Seems quite high.
> 
> Right, but consider it didn't increase real time at all.

Real time is bounded by disk bandwidth.  It's a consideration of course,
and all forms of caching increase cpu utilization for the cache miss
case, but in this case the overhead is excessive due to the lack of
batching and due to compression overhead.

> 
>> 'perf top' while this is running would be interesting.
> 
> I'll update later with this.
> 
>> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going:
>> > 
>> > 	kvm statistics
>> > 
>> > 	 kvm_exit                                   1952342   36037
>> > 	 kvm_entry                                  1952334   36034
>> > 	 kvm_hypercall                              1710568   33948
>> 
>> In that test, 56k pages/sec were transferred.  Why are we seeing only
>> 33k hypercalls/sec?  Shouldn't we have two hypercalls/page (one when
>> evicting a page to make some room, one to read the new page from tmem)?
> 
> The guest doesn't do eviction at all, in fact - it doesn't know how big
> the cache is so even if it wanted to, it couldn't evict pages (the only
> thing it does is invalidate pages which have changed in the guest).

IIUC, when the guest reads a page, it first has to make room in its own
pagecache; before dropping a clean page it calls cleancache to dispose
of it, which calls a hypercall which compresses and stores it on the
host.  Next a page is allocated and a cleancache hypercall is made to
see if it is in host tmem.  So two hypercalls per page, once guest
pagecache is full.

> 
> This means it only takes one hypercall/page instead of two.
>> > 
>> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none:
>> > 
>> > Zeros:
>> > 	12800+0 records in
>> > 	12800+0 records out
>> > 	53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s
>> > 
>> > 	real    11m44.536s
>> > 	user    0m0.088s
>> > 	sys     2m0.639s
>> > 	12800+0 records in
>> > 	12800+0 records out
>> > 	53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s
>> > 
>> > 	real    11m30.561s
>> > 	user    0m0.088s
>> > 	sys     1m57.637s
>> 
>> zcache appears not to be helping at all; it's just adding overhead.  Is
>> even the compressed file too large?
>> 
>> overhead = 1.4 usec/page.
> 
> Correct, I've had to further increase the size of this file so that
> zcache would fail here as well. The good case was tested before, here I
> wanted to see what will happen with files that wouldn't have much
> benefit from both regular caching and zcache.

Well, zeroes is not a good test for this since it minimizes zcache
allocation overhead.

>> 
>> 
>> Overhead = 19 usec/page.
>> 
>> This is pretty steep.  We have flash storage doing a million iops/sec,
>> and here you add 19 microseconds to that.
> 
> Might be interesting to test it with flash storage as well...

Try http://sg.danny.cz/sg/sdebug26.html.  You can use it to emulate a
large fast block device without needing tons of RAM (but you can still
populate it with nonzero data).

If using qemu, try ,aio=native to minimize overhead further.

> 
>> > 
>> > 
>> > This is a snapshot of kvm_stats while this test was running:
>> > 
>> > 	kvm statistics
>> > 
>> > 	 kvm_entry                                   168179   20729
>> > 	 kvm_exit                                    168179   20728
>> > 	 kvm_hypercall                               131808   16409
>> 
>> The last test was running 19k pages/sec, doesn't quite fit with this
>> measurement.  Is the measurement stable or fluctuating?
> 
> It's pretty stable when running the "zero" pages, but when switching to
> random files it somewhat fluctuates.

Well, weird.

> 
>> > 
>> > And finally, KVM TMEM enabled, with caching=writeback:
>> 
>> I'm not sure what the point of this is?  You have two host-caching
>> mechanisms running in parallel, are you trying to increase overhead
>> while reducing effective cache size?
> 
> I thought that you've asked for this test:
> 
> On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote:
>> while cache=writeback with cleancache enabled in the host should
>> give the same effect, but with the extra hypercalls, but with an extra
>> copy to manage the host pagecache.  It would be good to see results for all three settings.
> 

Ah, so it's a worser worst case.  But somehow it's better than cache=none?

>> My conclusion is that the overhead is quite high, but please double
>> check my numbers, maybe I missed something obvious.
> 
> I'm not sure what options I have to lower the overhead here, should I be
> using something other than hypercalls to communicate with the host?
> 
> I know that there are several things being worked on from zcache
> perspective (WasActive, batching, etc), but is there something that
> could be done within the scope of kvm-tmem?
> 
> It would be interesting in seeing results for Xen/TMEM and comparing
> them to these results.

Batching will drastically reduce the number of hypercalls.  A different
alternative is to use ballooning to feed the guest free memory so it
doesn't need to hypercall at all.  Deciding how to divide free memory
among the guests is hard (but then so is deciding how to divide tmem
memory among guests), and adding dedup on top of that is also hard (ksm?
zksm?).  IMO letting the guest have the memory and manage it on its own
will be much simpler and faster compared to the constant chatting that
has to go on if the host manages this memory.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html