Re: [PATCHv5 1/8] zsmalloc: add to mm/

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 






On Tue, Feb 19, 2013 at 9:54 AM, Seth Jennings <sjenning@xxxxxxxxxxxxxxxxxx> wrote:
On 02/19/2013 03:18 AM, Joonsoo Kim wrote:
> Hello, Seth.
> I'm not sure that this is right time to review, because I already have
> seen many effort of various people to promote zxxx series. I don't want to
> be a stopper to promote these. :)

Any time is good review time :)  Thanks for your review!

>
> But, I read the code, now, and then some comments below.
>
> On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote:
>> =========
>> DO NOT MERGE, FOR REVIEW ONLY
>> This patch introduces zsmalloc as new code, however, it already
>> exists in drivers/staging.  In order to build successfully, you
>> must select EITHER to driver/staging version OR this version.
>> Once zsmalloc is reviewed in this format (and hopefully accepted),
>> I will create a new patchset that properly promotes zsmalloc from
>> staging.
>> =========
>>
>> This patchset introduces a new slab-based memory allocator,
>> zsmalloc, for storing compressed pages.  It is designed for
>> low fragmentation and high allocation success rate on
>> large object, but <= PAGE_SIZE allocations.
>>
>> zsmalloc differs from the kernel slab allocator in two primary
>> ways to achieve these design goals.
>>
>> zsmalloc never requires high order page allocations to back
>> slabs, or "size classes" in zsmalloc terms. Instead it allows
>> multiple single-order pages to be stitched together into a
>> "zspage" which backs the slab.  This allows for higher allocation
>> success rate under memory pressure.
>>
>> Also, zsmalloc allows objects to span page boundaries within the
>> zspage.  This allows for lower fragmentation than could be had
>> with the kernel slab allocator for objects between PAGE_SIZE/2
>> and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
>> to 60% of it original size, the memory savings gained through
>> compression is lost in fragmentation because another object of
>> the same size can't be stored in the leftover space.
>>
>> This ability to span pages results in zsmalloc allocations not being
>> directly addressable by the user.  The user is given an
>> non-dereferencable handle in response to an allocation request.
>> That handle must be mapped, using zs_map_object(), which returns
>> a pointer to the mapped region that can be used.  The mapping is
>> necessary since the object data may reside in two different
>> noncontigious pages.
>>
>> zsmalloc fulfills the allocation needs for zram and zswap.
>>
>> Acked-by: Nitin Gupta <ngupta@xxxxxxxxxx>
>> Acked-by: Minchan Kim <minchan@xxxxxxxxxx>
>> Signed-off-by: Seth Jennings <sjenning@xxxxxxxxxxxxxxxxxx>
>> ---
>>  include/linux/zsmalloc.h |   49 ++
>>  mm/Kconfig               |   24 +
>>  mm/Makefile              |    1 +
>>  mm/zsmalloc.c            | 1124 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 1198 insertions(+)
>>  create mode 100644 include/linux/zsmalloc.h
>>  create mode 100644 mm/zsmalloc.c
>>
>> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
>> new file mode 100644
>> index 0000000..eb6efb6
>> --- /dev/null
>> +++ b/include/linux/zsmalloc.h
>> @@ -0,0 +1,49 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011  Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL
>> + * You can choose the license that better fits your requirements.
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version 2.0
>> + */
>> +
>> +#ifndef _ZS_MALLOC_H_
>> +#define _ZS_MALLOC_H_
>> +
>> +#include <linux/types.h>
>> +#include <linux/mm_types.h>
>> +
>> +/*
>> + * zsmalloc mapping modes
>> + *
>> + * NOTE: These only make a difference when a mapped object spans pages
>> +*/
>> +enum zs_mapmode {
>> +    ZS_MM_RW, /* normal read-write mapping */
>> +    ZS_MM_RO, /* read-only (no copy-out at unmap time) */
>> +    ZS_MM_WO /* write-only (no copy-in at map time) */
>> +};
>
>
> These makes no difference for PGTABLE_MAPPING.
> Please add some comment for this.

Yes. Will do.

>
>> +struct zs_ops {
>> +    struct page * (*alloc)(gfp_t);
>> +    void (*free)(struct page *);
>> +};
>> +
>> +struct zs_pool;
>> +
>> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);
>> +void zs_destroy_pool(struct zs_pool *pool);
>> +
>> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
>> +void zs_free(struct zs_pool *pool, unsigned long obj);
>> +
>> +void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>> +                    enum zs_mapmode mm);
>> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
>> +
>> +u64 zs_get_total_size_bytes(struct zs_pool *pool);
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 278e3ab..25b8f38 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -446,3 +446,27 @@ config FRONTSWAP
>>        and swap data is stored as normal on the matching swap device.
>>
>>        If unsure, say Y to enable frontswap.
>> +
>> +config ZSMALLOC
>> +    tristate "Memory allocator for compressed pages"
>> +    default n
>> +    help
>> +      zsmalloc is a slab-based memory allocator designed to store
>> +      compressed RAM pages.  zsmalloc uses virtual memory mapping
>> +      in order to reduce fragmentation.  However, this results in a
>> +      non-standard allocator interface where a handle, not a pointer, is
>> +      returned by an alloc().  This handle must be mapped in order to
>> +      access the allocated space.
>> +
>> +config PGTABLE_MAPPING
>> +    bool "Use page table mapping to access object in zsmalloc"
>> +    depends on ZSMALLOC
>> +    help
>> +      By default, zsmalloc uses a copy-based object mapping method to
>> +      access allocations that span two pages. However, if a particular
>> +      architecture (ex, ARM) performs VM mapping faster than copying,
>> +      then you should select this. This causes zsmalloc to use page table
>> +      mapping rather than copying for object mapping.
>> +
>> +      You can check speed with zsmalloc benchmark[1].
>> +      [1] https://github.com/spartacus06/zsmalloc
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 3a46287..0f6ef0a 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>> +obj-$(CONFIG_ZSMALLOC)      += zsmalloc.o
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> new file mode 100644
>> index 0000000..34378ef
>> --- /dev/null
>> +++ b/mm/zsmalloc.c
>> @@ -0,0 +1,1124 @@
>> +/*
>> + * zsmalloc memory allocator
>> + *
>> + * Copyright (C) 2011  Nitin Gupta
>> + *
>> + * This code is released using a dual license strategy: BSD/GPL
>> + * You can choose the license that better fits your requirements.
>> + *
>> + * Released under the terms of 3-clause BSD License
>> + * Released under the terms of GNU General Public License Version 2.0
>> + */
>> +
>> +
>> +/*
>> + * This allocator is designed for use with zcache and zram. Thus, the
>> + * allocator is supposed to work well under low memory conditions. In
>> + * particular, it never attempts higher order page allocation which is
>> + * very likely to fail under memory pressure. On the other hand, if we
>> + * just use single (0-order) pages, it would suffer from very high
>> + * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy
>> + * an entire page. This was one of the major issues with its predecessor
>> + * (xvmalloc).
>> + *
>> + * To overcome these issues, zsmalloc allocates a bunch of 0-order pages
>> + * and links them together using various 'struct page' fields. These linked
>> + * pages act as a single higher-order page i.e. an object can span 0-order
>> + * page boundaries. The code refers to these linked pages as a single entity
>> + * called zspage.
>> + *
>> + * For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE
>> + * since this satisfies the requirements of all its current users (in the
>> + * worst case, page is incompressible and is thus stored "as-is" i.e. in
>> + * uncompressed form). For allocation requests larger than this size, failure
>> + * is returned (see zs_malloc).
>> + *
>> + * Additionally, zs_malloc() does not return a dereferenceable pointer.
>> + * Instead, it returns an opaque handle (unsigned long) which encodes actual
>> + * location of the allocated object. The reason for this indirection is that
>> + * zsmalloc does not keep zspages permanently mapped since that would cause
>> + * issues on 32-bit systems where the VA region for kernel space mappings
>> + * is very small. So, before using the allocating memory, the object has to
>> + * be mapped using zs_map_object() to get a usable pointer and subsequently
>> + * unmapped using zs_unmap_object().
>> + *
>> + * Following is how we use various fields and flags of underlying
>> + * struct page(s) to form a zspage.
>> + *
>> + * Usage of struct page fields:
>> + *  page->first_page: points to the first component (0-order) page
>> + *  page->index (union with page->freelist): offset of the first object
>> + *          starting in this page. For the first page, this is
>> + *          always 0, so we use this field (aka freelist) to point
>> + *          to the first free object in zspage.
>> + *  page->lru: links together all component pages (except the first page)
>> + *          of a zspage
>> + *
>> + *  For _first_ page only:
>> + *
>> + *  page->private (union with page->first_page): refers to the
>> + *          component page after the first page
>> + *  page->freelist: points to the first free object in zspage.
>> + *          Free objects are linked together using in-place
>> + *          metadata.
>> + *  page->objects: maximum number of objects we can store in this
>> + *          zspage (class->zspage_order * PAGE_SIZE / class->size)
>
> How about just embedding maximum number of objects to size_class?
> For the SLUB, each slab can have difference number of objects.
> But, for the zsmalloc, it is not possible, so there is no reason
> to maintain it within metadata of zspage. Just to embed it to size_class
> is sufficient.

Yes, a little code massaging and this can go away.

However, there might be some value in having variable sized zspages in
the same size_class.  It could improve allocation success rate at the
expense of efficiency by not failing in alloc_zspage() if we can't
allocate the optimal number of pages.  As long as we can allocate the
first page, then we can proceed.


Yes, I remember trying to allow partial failures and thus allow variable
sized zspages within the same size class but I just skipped that since
it seemed like non-trivial to do and the value wasn't clear: if the
system cannot give us 4 (non-contiguous) pages then it's probably
going to fail allocation requests for single pages also in not so far
future. Also, for objects > PAGESIZE/2, failing over to zspage
of just PAGESIZE is not going to help either.
 
 
>
>
>> + *  page->lru: links together first pages of various zspages.
>> + *          Basically forming list of zspages in a fullness group.
>> + *  page->mapping: class index and fullness group of the zspage
>> + *
>> + * Usage of struct page flags:
>> + *  PG_private: identifies the first component page
>> + *  PG_private2: identifies the last component page
>> + *
<snip>
 
>> + */
>> +struct size_class {
>> +    /*
>> +     * Size of objects stored in this class. Must be multiple
>> +     * of ZS_ALIGN.
>> +     */
>> +    int size;
>> +    unsigned int index;
>> +
>> +    /* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
>> +    int pages_per_zspage;
>> +
>> +    spinlock_t lock;
>> +
>> +    /* stats */
>> +    u64 pages_allocated;
>> +
>> +    struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
>> +};
>
> Instead of simple pointer, how about using list_head?
> With this, fullness_list management is easily consolidated to
> set_zspage_mapping() and we can remove remove_zspage(), insert_zspage().



 
Yes, this looks like a nice cleanup.
Still, I would hold such changes till we get out of staging.


Thanks for your comments, Kim.


Nitin


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]