RE: Frontswap [PATCH 0/4] (was Transcendent Memory): overview

Dan Magenheimer <dan.magenheimer@xxxxxxxxxx> · Mon, 3 May 2010 09:01:15 -0700 (PDT)

> > Simple policies must exist and must be enforced by the hypervisor to
> ensure
> > this doesn't happen.  Xen+tmem provides these policies and enforces
> them.
> > And it enforces them very _dynamically_ to constantly optimize
> > RAM utilization across multiple guests each with dynamically varying
> RAM
> > usage.  Frontswap fits nicely into this framework.
>
> Can you explain what "enforcing" means in this context?  You loaned the
> guest some pages, can you enforce their return?

We're getting into hypervisor policy issues, but given that probably
nobody else is listening by now, I guess that's OK. ;-)

The enforcement is on the "put" side.  The page is not loaned,
it is freely given, but only if the guest is within its
contractual limitations (e.g. within its predefined "maxmem").
If the guest chooses to never remove the pages from frontswap,
that's the guest's option, but that part of the guests
memory allocation can never be used for anything else so
it is in the guest's self-interest to "get" or "flush" the
pages from frontswap.

> > Huge performance hits that are completely inexplicable to a user
> > give virtualization a bad reputation.  If the user (i.e. guest,
> > not host, administrator) can at least see "Hmmm... I'm doing a lot
> > of swapping, guess I'd better pay for more (virtual) RAM", then
> > the user objections are greatly reduced.
> 
> What you're saying is "don't overcommit".

Not at all.  I am saying "overcommit, but do it intelligently".

> That's a good policy for some
> scenarios but not for others.  Note it applies equally well for cpu as
> well as memory.

Perhaps, but CPU overcommit has been a well-understood
part of computing for a very long time and users, admins,
and hosting providers all know how to recognize it and
deal with it.  Not so with overcommitment of memory;
the only exposure to memory limitations is "my disk light
is flashing a lot, I'd better buy more RAM".  Obviously,
this doesn't translate to virtualization very well.

And, as for your interrupt latency analogy, let's
revisit that if/when Xen or KVM support CPU overcommitment
for real-time-sensitive guests.  Until then, your analogy
is misleading.

> frontswap+tmem is not overcommit, it's undercommit.   You have spare
> memory, and you give it away.  It isn't a replacement.  However,
> without
> the means to reclaim this spare memory, it can result in overcommit.

But you are missing part of the magic:  Once the memory
page is no longer directly addressable (AND this implies not
directly writable) by the guest, the hypervisor can do interesting
things with it, such as compression and deduplication.

As a result, the sum of pages used by all the guests exceeds
the total pages of RAM in the system.  Thus overcommitment.
I agree that the degree of overcommitment is less than possible
with host-swapping, but none of the evil issues of host-swapping
happen. Again, this is "intelligent overcommitment".  Other
existing forms are "overcommit and cross your fingers that bad
things don't happen."

> > Xen+tmem uses the SAME internal kernel interface.  The Xen-specific
> > code which performs the Xen-specific stuff (hypercalls) is only in
> > the Xen-specific directory.
> 
> This makes it an external interface.
>  :
> Something completely internal to the guest can be replaced by something
> completely different.  Something that talks to a hypervisor will need
> those hooks forever to avoid regressions.

Uh, no.  As I've said, everything about frontswap is entirely
optional, both at compile-time and run-time.  A frontswap-enabled
guest is fully compatible with a hypervisor with no frontswap;
a frontswap-enabled hypervisor is fully compatible with a guest
with no frontswap.  The only thing that is reserved forever is
a hypervisor-specific "hypercall number" which is not exposed in
the Linux kernel except in Xen-specific code.  And, for Xen,
frontswap shares the same hypercall number with cleancache.

So, IMHO, you are being alarmist.  This is not an "API
maintenance" problem for Linux.

> Exactly as large as the swap space which the guest would have in the
> frontswap+tmem case.
>  :
> Not needed, though I expect it is already supported (SAN volumes do
> grow).
>  :
> If block layer overhead is a problem, go ahead and optimize it instead
> of adding new interfaces to bypass it.  Though I expect it wouldn't be
> needed, and if any optimization needs to be done it is in the swap
> layer.
> Optimizing swap has the additional benefit of improving performance on
> flash-backed swap.
>  :
> What happens when no tmem is available?  you swap to a volume.  That's
> the disk size needed.
>  :
> You're dynamic swap is limited too.  And no, no guest modifications.

You keep saying you are going to implement all of the dynamic features
of frontswap with no changes to the guest and no copying and no
host-swapping.  You are being disingenuous.  VMware has had a lot
of people working on virtualization a lot longer than you or I have.
Don't you think they would have done this by now?

Frontswap exists today and is even shipping in real released products.
If you can work your magic (in Xen... I am not trying to claim
frontswap should work with KVM), please show us the code.

> So, you take a synchronous copyful interface, add another copy to make
> it into an asynchronous interface, instead of using the original
> asynchronous copyless interface.

"Add another copy" is not required any more than it is with the
other examples you cited.

The "original asynchronous copyless interface" works because DMA
for devices has been around for >40 years and has been greatly
refined.  We're not talking about DMA to a device here, we're
talking about DMA from one place in RAM to another (i.e. from
guest RAM to hypervisor RAM).  Do you have examples of DMA engines
that do page-size-ish RAM-to-RAM more efficiently than copying?

> The networking stack seems to think 4096 bytes is a good size for dma
> (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK).

Networking is a device-to-RAM, not RAM-to-RAM.

> When swapping out, Linux already batches pages in the block device's
> request queue.  Swapping out is inherently asynchronous and batched,
> you're swapping out those pages _because_ you don't need them, and
> you're never interested in swapping out a single page.  Linux already
> reserves memory for use during swapout.  There's no need to re-solve
> solved problems.

Swapping out is inherently asynchronous and batches because it was
designed for swapping to a device, while you are claiming that the
same _unchanged_ interface is suitable for swap-to-hypervisor-RAM
and at the same time saying that the block layer might need
to be "optimized" (apparently without code changes).

I'm not trying to re-solve a solved problem; frontswap solves a NEW
problem, with very little impact to existing code.

> Swapping in is less simple, it is mostly synchronous (in some cases it
> isn't: with many threads, or with the preswap patches (IIRC unmerged)).
> You can always choose to copy if you don't have enough to justify dma.

Do you have a pointer to these preswap patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href