Re: [PATCH V4 0/4] mm: frontswap: overview

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dan,

What is the plan for getting this upstream?  Are there some issues or objections that haven't been addressed?

--
Seth

On 01/-10/-28163 01:59 PM,  wrote:
> [PATCH V4 0/4] mm: frontswap: overview
> 
> Changes since V3:
> - Rebased to 2.6.39 (accomodates minor code movement in swapfile.c)
> 
> Changes since V2:
> - Rebased to 2.6.36-rc5 (main change: swap_info is now array of pointers)
> - Added set/end_page_writeback calls around page unlock on successful put
> - Changed frontswap_init to hide frontswap_poolid (which is cleancache/tmem
>   oddity that need not be exposed to frontswap)
> - Document and ensure PageLocked requirements are met (per Andrew Morton
>   feedback in cleancache thread)
> - Remove incorrect flags set/clear around partial swapoff call in
>   frontswap_shrink
> - Clarified code testing if frontswap is enabled
> - Add frontswap_register_ops interface to avoid an unnecessary global (per
>   Christoph Hellwig suggestion in cleancache thread)
> - Use standard success/fail codes (0/<0) (per Nitin Gupta feedback on
>   cleancache patch)
> - Added Documentations/vm/frontswap.txt including a FAQ (per Andrew Morton
>   feedback in cleancache thread)
> - Added Documentation/ABI/testing/sysfs-kernel-mm-frontswap to describe
>   sysfs usage (per Andrew Morton feedback in cleancache thread)
> - Minor static variable naming cleanup (per Jeremy Fitzhardinge feedback
>   in cleancache thread)
> 
> Changes since V1:
> - Rebased to 2.6.34 (no functional changes)
> - Convert to sane types (per Al Viro comment in cleancache thread)
> - Define some raw constants (Konrad Wilk)
> - Performance analysis shows significant advantage for frontswap's
>   synchronous page-at-a-time design (vs batched asynchronous speculated
>   as an alternative design).  See http://lkml.org/lkml/2010/5/20/314
> 
> This "frontswap" patchset provides a clean API to transcendent memory
> for swap pages; via this API, frontswap can provide "swap to RAM"
> functionality for any transcendent memory "driver" such as a Xen tmem,
> or in-kernel compression via zcache; frontswap also provides a nice interfa=
> ce
> for swapping to RAM on a remote system (RAMster) and for building
> pseudo-RAM devices such as on-memory-bus SSD or phase-change memory.
> 
> A more complete description of frontswap can be found in the introductory
> comment in Documentation/vm/frontswap.txt (in PATCH 2/4) which is included
> below for convenience.
> 
> Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
> and will soon ship in a release of Oracle Enterprise Linux.  Underlying
> tmem technology is now shipping in Oracle VM 2.2 and Xen 4.0.
> 
> Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
> Reviewed-by: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
> 
>  Documentation/ABI/testing/sysfs-kernel-mm-frontswap |   16=20
>  Documentation/vm/frontswap.txt                      |  210 ++++++++++++
>  include/linux/frontswap.h                           |   86 +++++
>  include/linux/swap.h                                |    2=20
>  include/linux/swapfile.h                            |   13=20
>  mm/Kconfig                                          |   16=20
>  mm/Makefile                                         |    1=20
>  mm/frontswap.c                                      |  331 +++++++++++++++=
> +++++
>  mm/page_io.c                                        |   12=20
>  mm/swapfile.c                                       |   58 ++-
>  10 files changed, 736 insertions(+), 9 deletions(-)
> 
> (following is a copy of Documentation/vm/frontswap.txt including a FAQ)
> 
> Frontswap provides a "transcendent memory" interface for swap pages.
> In some environments, dramatic performance savings may be obtained because
> swapped pages are saved in RAM (or a RAM-like device) instead of a swap dis=
> k.
> 
> Frontswap is so named because it can be thought of as the opposite of
> a "backing" store for a swap device.  The storage is assumed to be
> a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
> to the requirements of transcendent memory (such as Xen's "tmem", or
> in-kernel compressed memory, aka "zcache", or future RAM-like devices);
> this pseudo-RAM device is not directly accessible or addressable by the
> kernel and is of unknown and possibly time-varying size.  The "device"
> links itself to frontswap by calling frontswap_register_ops to set the
> frontswap_ops funcs appropriately and the functions it provides must
> conform to certain policies as follows:
> 
> An "init" prepares the device to receive frontswap pages associated
> with the specified swap device number (aka "type").  A "put_page" will
> copy the page to transcendent memory and associate it with the type and
> offset associated with the page. A "get_page" will copy the page, if found,
> =66rom transcendent memory into kernel memory, but will NOT remove the page
> =66rom from transcendent memory.  A "flush_page" will remove the page from
> transcendent memory and a "flush_area" will remove ALL pages associated
> with the swap type (e.g., like swapoff) and notify the "device" to refuse
> further puts with that swap type.
> 
> Once a page is successfully put, a matching get on the page will always
> succeed.  So when the kernel finds itself in a situation where it needs
> to swap out a page, it first attempts to use frontswap.  If the put returns
> non-zero, the data has been successfully saved to transcendent memory and
> a disk write and, if the data is later read back, a disk read are avoided.
> If a put returns zero, transcendent memory has rejected the data, and the
> page can be written to swap as usual.
> 
> Note that if a page is put and the page already exists in transcendent memo=
> ry
> (a "duplicate" put), either the put succeeds and the data is overwritten,
> or the put fails AND the page is flushed.  This ensures stale data may
> never be obtained from psuedo-RAM.
> 
> Monitoring and control of frontswap is done by sysfs files in the
> /sys/kernel/mm/frontswap directory.  The effectiveness of frontswap can
> be measured (across all swap devices) with:
> 
> curr_pages	- number of pages currently contained in frontswap
> failed_puts	- how many put attempts have failed
> gets		- how many gets were attempted (all should succeed)
> succ_puts	- how many put attempts have succeeded
> flushes		- how many flushes were attempted
> 
> The number can be reduced by root by writing an integer target to curr_page=
> s,
> which results in a "partial swapoff", thus reducing the number of frontswap
> pages to that target if memory constraints permit.
> 
> FAQ
> 
> 1) Where's the value?
> 
> When a workload starts swapping, performance falls through the floor.
> Frontswap significantly increases performance in many such workloads by
> providing a clean, dynamic interface to read and write swap pages to
> "transcendent" memory that is otherwise not directly addressable to the ker=
> nel.
> This interface is ideal when data is transformed to a different form
> and size (such as with compression) or secretly moved (as might be
> useful for write-balancing for some RAM-like devices).  Swap pages (and
> evicted page-cache pages) are a great use for this kind of slower-than-RAM-
> but-much-faster-than-disk "pseudo-RAM device" and the frontswap (and
> cleancache) interface to transcendent memory provides a nice way to read
> and write -- and indirectly "name" -- the pages.
> 
> In the virtual case, the whole point of virtualization is to statistically
> multiplex physical resources acrosst the varying demands of multiple
> virtual machines.  This is really hard to do with RAM and efforts to do
> it well with no kernel changes have essentially failed (except in some
> well-publicized special-case workloads).  Frontswap -- and cleancache --
> with a fairly small impact on the kernel, provides a huge amount
> of flexibility for more dynamic, flexible RAM multiplexing.
> Specifically, the Xen Transcendent Memory backend allows otherwise
> "fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
> virtual machines, but the pages can be compressed and deduplicated to
> optimize RAM utilization.  And when guest OS's are induced to surrender
> underutilized RAM (e.g. with "self-ballooning"), sudden unexpected
> memory pressure may result in swapping; frontswap allows those pages
> to be swapped to and from hypervisor RAM if overall host system memory
> conditions allow.
> 
> 2) Sure there may be performance advantages in some situations, but
>    what's the space/time overhead of frontswap?
> 
> If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
> nothingness and the only overhead is a few extra bytes per swapon'ed
> swap device.  If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
> registers, there is one extra global variable compared to zero for
> every swap page read or written.  If CONFIG_FRONTSWAP is enabled
> AND a frontswap backend registers AND the backend fails every "put"
> request (i.e. provides no memory despite claiming it might),
> CPU overhead is still negligible -- and since every frontswap fail
> precedes a swap page write-to-disk, the system is highly likely
> to be I/O bound and using a small fraction of a percent of a CPU
> will be irrelevant anyway.
> 
> As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
> registers, one bit is allocated for every swap page for every swap
> device that is swapon'd.  This is added to the EIGHT bits (which
> was sixteen until about 2.6.34) that the kernel already allocates
> for every swap page for every swap device that is swapon'd.  (Hugh
> Dickins has observed that frontswap could probably steal one of
> the existing eight bits, but let's worry about that minor optimization
> later.)  For very large swap disks (which are rare) on a standard
> 4K pagesize, this is 1MB per 32GB swap.
> 
> 3) OK, how about a quick overview of what this frontswap patch does
>    in terms that a kernel hacker can grok?
> 
> Let's assume that a frontswap "backend" has registered during
> kernel initialization; this registration indicates that this
> frontswap backend has access to some "memory" that is not directly
> accessible by the kernel.  Exactly how much memory it provides is
> entirely dynamic and random.
> 
> Whenever a swap-device is swapon'd frontswap_init() is called,
> passing the swap device number (aka "type") as a parameter.
> This notifies frontswap to expect attempts to "put" swap pages
> associated with that number.
> 
> Whenever the swap subsystem is readying a page to write to a swap
> device (c.f swap_writepage()), frontswap_put_page is called.  Frontswap
> consults with the frontswap backend and if the backend says
> it does NOT have room, frontswap_put_page returns 0 and the page is
> swapped as normal.  Note that the response from the frontswap
> backend is essentially random; it may choose to never accept a
> page, it could accept every ninth page, or it might accept every
> page.  But if the backend does accept a page, the data from the page
> has already been copied and associated with the type and offset,
> and the backend guarantees the persistence of the data.  In this case,
> frontswap sets a bit in the "frontswap_map" for the swap device
> corresponding to the page offset on the swap device to which it would
> otherwise have written the data.
> 
> When the swap subsystem needs to swap-in a page (swap_readpage()),
> it first calls frontswap_get_page() which checks the frontswap_map to
> see if the page was earlier accepted by the frontswap backend.  If
> it was, the page of data is filled from the frontswap backend and
> the swap-in is complete.  If not, the normal swap-in code is
> executed to obtain the page of data from the real swap device.
> 
> So every time the frontswap backend accepts a page, a swap device read
> and (potentially) a swap device write are replaced by a "frontswap backend
> put" and (possibly) a "frontswap backend get", which are presumably much
> faster.
> 
> 4) Can't frontswap be configured as a "special" swap device that is
>    just higher priority than any real swap device (e.g. like zswap)?
> 
> No.  Recall that acceptance of any swap page by the frontswap
> backend is entirely unpredictable. This is critical to the definition
> of frontswap because it grants completely dynamic discretion to the
> backend.  But since any "put" might fail, there must always be a real
> slot on a real swap device to swap the page.  Thus frontswap must be
> implemented as a "shadow" to every swapon'd device with the potential
> capability of holding every page that the swap device might have held
> and the possibility that it might hold no pages at all.
> On the downside, this also means that frontswap cannot contain more
> pages than the total of swapon'd swap devices.  For example, if NO
> swap device is configured on some installation, frontswap is useless.
> 
> Further, frontswap is entirely synchronous whereas a real swap
> device is, by definition, asynchronous and uses block I/O.  The
> block I/O layer is not only unnecessary, but may perform "optimizations"
> that are inappropriate for a RAM-oriented device including delaying
> the write of some pages for a significant amount of time.
> Synchrony is required to ensure the dynamicity of the backend.
> 
> In a virtualized environment, the dynamicity allows the hypervisor
> (or host OS) to do "intelligent overcommit".  For example, it can
> choose to accept pages only until host-swapping might be imminent,
> then force guests to do their own swapping.
> 
> 5) Why this weird definition about "duplicate puts"?  If a page
>    has been previously successfully put, can't it always be
>    successfully overwritten?
> 
> Nearly always it can, but no, sometimes it cannot.  Consider an example
> where data is compressed and the original 4K page has been compressed
> to 1K.  Now an attempt is made to overwrite the page with data that
> is non-compressible and so would take the entire 4K.  But the backend
> has no more space.  In this case, the put must be rejected.  Whenever
> frontswap rejects a put that would overwrite, it also must flush
> the old data and ensure that it is no longer accessible.  Since the
> swap subsystem then writes the new data to the read swap device,
> this is the correct course of action to ensure coherency.
> 
> 6) What is frontswap_shrink for?
> 
> When the (non-frontswap) swap subsystem swaps out a page to a real
> swap device, that page is only taking up low-value pre-allocated disk
> space.  But if frontswap has placed a page in transcendent memory, that
> page may be taking up valuable real estate.  The frontswap_shrink
> routine allows a process outside of the swap subsystem (such as
> a userland service via the sysfs interface, or a kernel thread)
> to force pages out of the memory managed by frontswap and back into
> kernel-addressable memory.
> 
> 7) Why does the frontswap patch create the new include file swapfile.h?
> 
> The frontswap code depends on some swap-subsystem-internal data
> structures that have, over the years, moved back and forth between
> static and global.  This seemed a reasonable compromise:  Define
> them as global but declare them in a new include file that isn't
> included by the large number of source files that include swap.h.
> 
> Dan Magenheimer, last updated May 27, 2011
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.c=
> a/
> Don't email: <a href=3Dmailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
> 
> 
> --GvXjxJ+pjyke8COw--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]