Re: [RFC 00/10] KVM: Add TMEM host/guest support

Avi Kivity <avi@xxxxxxxxxx> · Tue, 12 Jun 2012 13:09:46 +0300

On 06/12/2012 04:18 AM, Dan Magenheimer wrote:
>> From: Avi Kivity [mailto:avi@xxxxxxxxxx]
>> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>> 
>> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
>> > > >> This is pretty steep.  We have flash storage doing a million iops/sec,
>> > > >> and here you add 19 microseconds to that.
>> > > >
>> > > > Might be interesting to test it with flash storage as well...
>> >
>> > Well, to be fair, you are comparing a device that costs many
>> > thousands of $US to a software solution that uses idle CPU
>> > cycles and no additional RAM.
>> 
>> You don't know that those cycles are idle.  And when in fact you have no
>> additional RAM, those cycles are wasted to no benefit.
>> 
>> The fact that I/O is being performed doesn't mean that we can waste
>> cpu.  Those cpu cycles can be utilized by other processes on the same
>> guest or by other guests.
> 
> You're right of course, so I apologize for oversimplifying... but
> so are you.  Let's take a step back:
> 
> IMHO, a huge part (majority?) of computer science these days is
> trying to beat Amdahl's law.  On many machines/workloads,
> especially in virtual environments, RAM is the bottleneck.
> Tmem's role is, when RAM is the bottleneck, to increase RAM
> effective size AND, in a multi-tenant environment, flexibility
> at the cost of CPU cycles.  But tmem also is designed to be very
> dynamically flexible so that it either has low CPU cost when it
> not being used OR can be dynamically disabled/re-enabled with
> reasonably low overhead.
> 
> Why I think you are oversimplifying:  "those cpu cycles can be
> utilized by other processes on the same guest or by other
> guests" pre-supposes that cpu availability is the bottleneck.
> It would be interesting if it were possible to measure how
> many systems (with modern processors) for which this is true.
> I'm not arguing that they don't exist but I suspect they are
> fairly rare these days, even for KVM systems.

In a given host, either cpu or memory is the bottleneck.  If you have
both free memory and free cycles, you pack more guests on that machine.
 During off-peak you may have both, but we need to see what happens
during the peak; off-peak we're doing okay.

So on such a host, during peak, either the cpu is churning away and we
can't spare those cycles for tmem, or memory is packed full of guests
and tmem won't provide much benefit (but still consume those cycles).

> 
>> > > Batching will drastically reduce the number of hypercalls.
>> >
>> > For the record, batching CAN be implemented... ramster is essentially
>> > an implementation of batching where the local system is the "guest"
>> > and the remote system is the "host".  But with ramster the
>> > overhead to move the data (whether batched or not) is much MUCH
>> > worse than a hypercall and ramster still shows performance advantage.
>> 
>> Sure, you can buffer pages in memory but then you add yet another copy.
>> I know you think copies are cheap but I disagree.
> 
> I only think copies are *relatively* cheap.  Orders of magnitude
> cheaper than some alternatives.  So if it takes two page copies
> or even ten to replace a disk access, yes I think copies are cheap.
> (But I do understand your point.)

The copies are cheaper that a disk access, yes, but you need to factor
in the probability of a disk access being saved.  cleancache already
works on the tail end of the lru, we're dumping those pages because they
have low access frequency, so the probability starts out low.  If many
guests are active (so we need the cpu resources), then they also compete
for tmem resources, and per-guest it becomes less effective as well.

> 
>> > So, IMHO, one step at a time.  Get the foundation code in
>> > place and tune it later if a batching implementation can
>> > be demonstrated to improve performance sufficiently.
>> 
>> Sorry, no, first demonstrate no performance regressions, then we can
>> talk about performance improvements.
> 
> Well that's an awfully hard bar to clear, even with any of the many
> changes being merged every release into the core Linux mm subsystem.
> Any change to memory management will have some positive impacts on some
> workloads and some negative impacts on others.

Right, that's too harsh.  But these benchmarks show a doubling (or even
more) of cpu overhead, and that is whether the cache is effective or
not.  That is simply way too much to consider.

Look at the block, vfs, and mm layers.  Huge pains have been taken to
batch everything and avoid per-page work -- 20 years of not having
enough cycles.  And here you throw all this out of the window with
per-page crossing of the guest/host boundary.

> 
>> > Yes, tmem has
>> > overhead but since the overhead only occurs where pages
>> > would otherwise have to be read/written from disk, the
>> > overhead is well "hidden".
>> 
>> The overhead is NOT hidden.  We spent many efforts to tune virtio-blk to
>> reduce its overhead, and now you add 6-20 microseconds per page.  A
>> guest may easily be reading a quarter million pages per second, this
>> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
>> 
>> Note that you don't even have to issue I/O to get a tmem hypercall
>> invoked.  Alllocate a ton of memory and you get cleancache calls for
>> each page that passes through the tail of the LRU.  Again with the upper
>> end, allocating a gigabyte can now take a few seconds extra.
> 
> Though not precisely so, we are arguing throughput vs latency here
> and the two can't always be mixed.
> 
> And if, in allocating a GB of memory, you are tossing out useful
> pagecache pages, and those pagecache pages can instead be preserved
> by tmem thus saving N page faults and order(N) disk accesses,
> your savings are false economy.  I think Sasha's numbers
> demonstrate that nicely.

It depends.  If you have an 8GB guest, then saving the tail end of an
8GB LRU may improve your caching or it may not.  But the impact on that
allocation is certain.  You're trading off possible marginal improvement
for unconditional performance degradation.

> 
> Anyway, as I've said all along, let's look at the numbers.
> I've always admitted that tmem on an old uniprocessor should
> be disabled.  If no performance degradation in that environment
> is a requirement for KVM-tmem to be merged, that is certainly
> your choice.  And if "more CPU cycles used" is a metric,
> definitely, tmem is not going to pass because that's exactly
> what it's doing: trading more CPU cycles for better RAM
> efficiency == less disk accesses.

Again, the cpu cycles spent are certain, and double the effort needed to
get those pages in the first place.  Disk accesses saved will depend on
the workload, and on host memory availability.  Turning tmem on will
certainly generate performance regressions as well as improvements.
Maybe on Xen the tradeoff is different (hypercalls ought to be faster on
xenpv), but the numbers I saw on kvm aren't good.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html