Re: [PATCH v2] mm: SLAB freelist randomization

Thomas Garnier <thgarnie@xxxxxxxxxx> · Wed, 20 Apr 2016 07:47:44 -0700

On Wed, Apr 20, 2016 at 1:08 AM, Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> wrote:
> On Tue, Apr 19, 2016 at 09:44:54AM -0700, Thomas Garnier wrote:
>> On Tue, Apr 19, 2016 at 12:15 AM, Joonsoo Kim <iamjoonsoo.kim@xxxxxxx> wrote:
>> > On Mon, Apr 18, 2016 at 10:14:39AM -0700, Thomas Garnier wrote:
>> >> Provides an optional config (CONFIG_FREELIST_RANDOM) to randomize the
>> >> SLAB freelist. The list is randomized during initialization of a new set
>> >> of pages. The order on different freelist sizes is pre-computed at boot
>> >> for performance. This security feature reduces the predictability of the
>> >> kernel SLAB allocator against heap overflows rendering attacks much less
>> >> stable.
>> >
>> > I'm not familiar on security but it doesn't look much secure than
>> > before. Is there any other way to generate different sequence of freelist
>> > for each new set of pages? Current approach using pre-computed array will
>> > generate same sequence of freelist for all new set of pages having same size
>> > class. Is it sufficient?
>> >
>>
>> I think it is sufficient. There is a tradeoff for performance. We could randomly
>> pick an object from the freelist every time (on slab_get_obj) but I
>> think it will
>> have significant impact (at least 3%).
>>
>> >> For example this attack against SLUB (also applicable against SLAB)
>> >> would be affected:
>> >> https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
>> >>
>> >> Also, since v4.6 the freelist was moved at the end of the SLAB. It means
>> >> a controllable heap is opened to new attacks not yet publicly discussed.
>> >> A kernel heap overflow can be transformed to multiple use-after-free.
>> >> This feature makes this type of attack harder too.
>> >>
>> >> To generate entropy, we use get_random_bytes_arch because 0 bits of
>> >> entropy is available at that boot stage. In the worse case this function
>> >> will fallback to the get_random_bytes sub API.
>> >>
>> >> The config option name is not specific to the SLAB as this approach will
>> >> be extended to other allocators like SLUB.
>> >
>> > If this feature will be applied to the SLUB, it's better to put common
>> > code to mm/slab_common.c.
>> >
>>
>> I think it might be moved there once we implement the SLUB counterpart
>> but it is too early to define which part will be common.
>>
>> >>
>> >> Performance results highlighted no major changes:
>> >>
>> >> Netperf average on 10 runs:
>> >>
>> >> threads,base,change
>> >> 16,576943.10,585905.90 (101.55%)
>> >> 32,564082.00,569741.20 (101.00%)
>> >> 48,558334.30,561851.20 (100.63%)
>> >> 64,552025.20,556448.30 (100.80%)
>> >> 80,552294.40,551743.10 (99.90%)
>> >> 96,552435.30,547529.20 (99.11%)
>> >> 112,551320.60,550183.20 (99.79%)
>> >> 128,549138.30,550542.70 (100.26%)
>> >> 144,549344.50,544529.10 (99.12%)
>> >> 160,550360.80,539929.30 (98.10%)
>> >>
>> >> slab_test 1 run on boot. After is faster except for odd result on size
>> >> 2048.
>> >
>> > Hmm... It's odd result. It adds more logic and it should
>> > decrease performance. I guess it would be experimental error but
>> > do you have any analysis about this result?
>> >
>>
>> I don't. I am glad to redo the test. I found that slab_test has very different
>> result based on the heap state at the time of the test. If I run the
>> test multiple
>> times, I have really various results on with or without the mitigation (on
>> dedicated hardware).
>>
>> >>
>> >> Before:
>> >>
>> >> Single thread testing
>> >> =====================
>> >> 1. Kmalloc: Repeatedly allocate then free test
>> >> 10000 times kmalloc(8) -> 137 cycles kfree -> 126 cycles
>> >> 10000 times kmalloc(16) -> 118 cycles kfree -> 119 cycles
>> >> 10000 times kmalloc(32) -> 112 cycles kfree -> 119 cycles
>> >> 10000 times kmalloc(64) -> 126 cycles kfree -> 123 cycles
>> >> 10000 times kmalloc(128) -> 135 cycles kfree -> 131 cycles
>> >> 10000 times kmalloc(256) -> 165 cycles kfree -> 104 cycles
>> >> 10000 times kmalloc(512) -> 174 cycles kfree -> 126 cycles
>> >> 10000 times kmalloc(1024) -> 242 cycles kfree -> 160 cycles
>> >> 10000 times kmalloc(2048) -> 478 cycles kfree -> 239 cycles
>> >> 10000 times kmalloc(4096) -> 747 cycles kfree -> 364 cycles
>> >> 10000 times kmalloc(8192) -> 774 cycles kfree -> 404 cycles
>> >> 10000 times kmalloc(16384) -> 849 cycles kfree -> 430 cycles
>> >> 2. Kmalloc: alloc/free test
>> >> 10000 times kmalloc(8)/kfree -> 118 cycles
>> >> 10000 times kmalloc(16)/kfree -> 118 cycles
>> >> 10000 times kmalloc(32)/kfree -> 118 cycles
>> >> 10000 times kmalloc(64)/kfree -> 121 cycles
>> >> 10000 times kmalloc(128)/kfree -> 118 cycles
>> >> 10000 times kmalloc(256)/kfree -> 115 cycles
>> >> 10000 times kmalloc(512)/kfree -> 115 cycles
>> >> 10000 times kmalloc(1024)/kfree -> 115 cycles
>> >> 10000 times kmalloc(2048)/kfree -> 115 cycles
>> >> 10000 times kmalloc(4096)/kfree -> 115 cycles
>> >> 10000 times kmalloc(8192)/kfree -> 115 cycles
>> >> 10000 times kmalloc(16384)/kfree -> 115 cycles
>> >>
>> >> After:
>> >>
>> >> Single thread testing
>> >> =====================
>> >> 1. Kmalloc: Repeatedly allocate then free test
>> >> 10000 times kmalloc(8) -> 99 cycles kfree -> 84 cycles
>> >> 10000 times kmalloc(16) -> 88 cycles kfree -> 83 cycles
>> >> 10000 times kmalloc(32) -> 90 cycles kfree -> 81 cycles
>> >> 10000 times kmalloc(64) -> 107 cycles kfree -> 97 cycles
>> >> 10000 times kmalloc(128) -> 134 cycles kfree -> 89 cycles
>> >> 10000 times kmalloc(256) -> 145 cycles kfree -> 97 cycles
>> >> 10000 times kmalloc(512) -> 177 cycles kfree -> 116 cycles
>> >> 10000 times kmalloc(1024) -> 223 cycles kfree -> 151 cycles
>> >> 10000 times kmalloc(2048) -> 1429 cycles kfree -> 221 cycles
>> >> 10000 times kmalloc(4096) -> 720 cycles kfree -> 348 cycles
>> >> 10000 times kmalloc(8192) -> 788 cycles kfree -> 393 cycles
>> >> 10000 times kmalloc(16384) -> 867 cycles kfree -> 433 cycles
>> >> 2. Kmalloc: alloc/free test
>> >> 10000 times kmalloc(8)/kfree -> 115 cycles
>> >> 10000 times kmalloc(16)/kfree -> 115 cycles
>> >> 10000 times kmalloc(32)/kfree -> 115 cycles
>> >> 10000 times kmalloc(64)/kfree -> 120 cycles
>> >> 10000 times kmalloc(128)/kfree -> 127 cycles
>> >> 10000 times kmalloc(256)/kfree -> 119 cycles
>> >> 10000 times kmalloc(512)/kfree -> 112 cycles
>> >> 10000 times kmalloc(1024)/kfree -> 112 cycles
>> >> 10000 times kmalloc(2048)/kfree -> 112 cycles
>> >> 10000 times kmalloc(4096)/kfree -> 112 cycles
>> >> 10000 times kmalloc(8192)/kfree -> 112 cycles
>> >> 10000 times kmalloc(16384)/kfree -> 112 cycles
>> >>
>> >> Signed-off-by: Thomas Garnier <thgarnie@xxxxxxxxxx>
>> >> ---
>> >> Based on next-20160418
>> >> ---
>> >>  init/Kconfig |   9 ++++
>> >>  mm/slab.c    | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> >>  2 files changed, 174 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/init/Kconfig b/init/Kconfig
>> >> index 0dfd09d..ee35418 100644
>> >> --- a/init/Kconfig
>> >> +++ b/init/Kconfig
>> >> @@ -1742,6 +1742,15 @@ config SLOB
>> >>
>> >>  endchoice
>> >>
>> >> +config FREELIST_RANDOM
>> >> +     default n
>> >> +     depends on SLAB
>> >> +     bool "SLAB freelist randomization"
>> >> +     help
>> >> +       Randomizes the freelist order used on creating new SLABs. This
>> >> +       security feature reduces the predictability of the kernel slab
>> >> +       allocator against heap overflows.
>> >> +
>> >>  config SLUB_CPU_PARTIAL
>> >>       default y
>> >>       depends on SLUB && SMP
>> >> diff --git a/mm/slab.c b/mm/slab.c
>> >> index b70aabf..8371d80 100644
>> >> --- a/mm/slab.c
>> >> +++ b/mm/slab.c
>> >> @@ -116,6 +116,7 @@
>> >>  #include     <linux/kmemcheck.h>
>> >>  #include     <linux/memory.h>
>> >>  #include     <linux/prefetch.h>
>> >> +#include     <linux/log2.h>
>> >>
>> >>  #include     <net/sock.h>
>> >>
>> >> @@ -1229,6 +1230,62 @@ static void __init set_up_node(struct kmem_cache *cachep, int index)
>> >>       }
>> >>  }
>> >>
>> >> +#ifdef CONFIG_FREELIST_RANDOM
>> >> +/*
>> >> + * Master lists are pre-computed random lists
>> >> + * Lists of different sizes are used to optimize performance on SLABS with
>> >> + * different object counts.
>> >> + */
>> >
>> > If it is for optimization, it would be one option to have separate
>> > random list for each kmem_cache. It would consume more memory but it
>> > would be marginal. And, it provides more un-predictability and it can
>> > give better performance because we don't need state->type (more, less)
>> > and special handling related for it.
>> >
>>
>> I am not sur because major caches are created early at boot time. We still have
>> the same entropy problem and we are wasting a bit more memory. It will be faster
>
> I think that entropy problem is another issue. It should be considered
> separately. If it is solved, making per-computed array for each
> kmem_cache will provide more un-predictability. If someone who succeed to
> exploit some kmem_cache with 128 object per slab want to exploit
> another kmem_cache with 128 object per slab, this separate pre-computed array
> will be helpful.
>
>> on usage though but not sure it will be significant.
>
> I also think it's not significant. But, besides performance effect,
> code doesn't look very attractive and extendable. In case of SLUB,
> there is setup_slub_max_order option and object per slab could be larger
> than 256. To deal with it, we need to add many more static definition
> and it looks not good to me. Please use dynamic allocated memory
> instead of static array definition.
>

You don't need to. We wrap the list used (if you look at get_next_entry
we reset at pos 0 when we arrive to the list size).

I do think that the design will be better with a dedicated list per cache. Given
you seem fine with the memory differences, performance can only get better...

I will refactor for that on the next iteration.

>>
>> >> +static freelist_idx_t master_list_2[2];
>> >> +static freelist_idx_t master_list_4[4];
>> >> +static freelist_idx_t master_list_8[8];
>> >> +static freelist_idx_t master_list_16[16];
>> >> +static freelist_idx_t master_list_32[32];
>> >> +static freelist_idx_t master_list_64[64];
>> >> +static freelist_idx_t master_list_128[128];
>> >> +static freelist_idx_t master_list_256[256];
>> >> +const static struct m_list {
>> >> +     size_t count;
>> >> +     freelist_idx_t *list;
>> >> +} master_lists[] = {
>> >> +     { ARRAY_SIZE(master_list_2), master_list_2 },
>> >> +     { ARRAY_SIZE(master_list_4), master_list_4 },
>> >> +     { ARRAY_SIZE(master_list_8), master_list_8 },
>> >> +     { ARRAY_SIZE(master_list_16), master_list_16 },
>> >> +     { ARRAY_SIZE(master_list_32), master_list_32 },
>> >> +     { ARRAY_SIZE(master_list_64), master_list_64 },
>> >> +     { ARRAY_SIZE(master_list_128), master_list_128 },
>> >> +     { ARRAY_SIZE(master_list_256), master_list_256 },
>> >> +};
>> >> +
>> >> +/* Pre-compute the Freelist master lists at boot */
>> >> +static void __init freelist_random_init(void)
>> >> +{
>> >> +     unsigned int seed;
>> >> +     size_t z, i, rand;
>> >> +     struct rnd_state slab_rand;
>> >> +
>> >> +     get_random_bytes_arch(&seed, sizeof(seed));
>> >> +     prandom_seed_state(&slab_rand, seed);
>> >> +
>> >> +     for (z = 0; z < ARRAY_SIZE(master_lists); z++) {
>> >> +             for (i = 0; i < master_lists[z].count; i++)
>> >> +                     master_lists[z].list[i] = i;
>> >> +
>> >> +             /* Fisher-Yates shuffle */
>> >> +             for (i = master_lists[z].count - 1; i > 0; i--) {
>> >> +                     rand = prandom_u32_state(&slab_rand);
>> >> +                     rand %= (i + 1);
>> >> +                     swap(master_lists[z].list[i],
>> >> +                             master_lists[z].list[rand]);
>> >> +             }
>> >> +     }
>> >> +}
>> >> +#else
>> >> +static inline void __init freelist_random_init(void) { }
>> >> +#endif /* CONFIG_FREELIST_RANDOM */
>> >> +
>> >> +
>> >>  /*
>> >>   * Initialisation.  Called after the page allocator have been initialised and
>> >>   * before smp_init().
>> >> @@ -1255,6 +1312,8 @@ void __init kmem_cache_init(void)
>> >>       if (!slab_max_order_set && totalram_pages > (32 << 20) >> PAGE_SHIFT)
>> >>               slab_max_order = SLAB_MAX_ORDER_HI;
>> >>
>> >> +     freelist_random_init();
>> >> +
>> >>       /* Bootstrap is tricky, because several objects are allocated
>> >>        * from caches that do not exist yet:
>> >>        * 1) initialize the kmem_cache cache: it contains the struct
>> >> @@ -2442,6 +2501,107 @@ static void cache_init_objs_debug(struct kmem_cache *cachep, struct page *page)
>> >>  #endif
>> >>  }
>> >>
>> >> +#ifdef CONFIG_FREELIST_RANDOM
>> >> +/* Identify if the target freelist matches the pre-computed list */
>> >> +enum master_type {
>> >> +     match,
>> >> +     less,
>> >> +     more
>> >> +};
>> >> +
>> >> +/* Hold information during a freelist initialization */
>> >> +struct freelist_init_state {
>> >> +     unsigned int padding;
>> >> +     unsigned int pos;
>> >> +     unsigned int count;
>> >> +     struct m_list master_list;
>> >> +     unsigned int master_count;
>> >> +     enum master_type type;
>> >> +};
>> >> +
>> >> +/* Select the right pre-computed master list and initialize state */
>> >> +static void freelist_state_initialize(struct freelist_init_state *state,
>> >> +                                   unsigned int count)
>> >> +{
>> >> +     unsigned int idx;
>> >> +     const unsigned int last_idx = ARRAY_SIZE(master_lists) - 1;
>> >> +
>> >> +     memset(state, 0, sizeof(*state));
>> >> +     state->count = count;
>> >> +     state->pos = 0;
>> >
>> > Using pos = 0 here looks not good in terms of security. In this case,
>> > every new page having same size class have same sequence of freelist since boot.
>> >
>> > How about using random value to set pos? It provides some more randomness
>> > with minimal overhead.
>> >
>>
>> I think it is a good idea. I will add that for the next iteration.
>>
>> >> +     /* count is always >= 2 */
>> >> +     idx = ilog2(count) - 1;
>> >> +     if (idx >= last_idx)
>> >> +             idx = last_idx;
>> >> +     else if (roundup_pow_of_two(idx + 1) != count)
>> >> +             idx++;
>> >> +     state->master_list = master_lists[idx];
>> >> +     if (state->master_list.count == state->count)
>> >> +             state->type = match;
>> >> +     else if (state->master_list.count > state->count)
>> >> +             state->type = more;
>> >> +     else
>> >> +             state->type = less;
>> >> +}
>> >> +
>> >> +/* Get the next entry on the master list depending on the target list size */
>> >> +static freelist_idx_t get_next_entry(struct freelist_init_state *state)
>> >> +{
>> >> +     if (state->type == less && state->pos == state->master_list.count) {
>> >> +             state->padding += state->pos;
>> >> +             state->pos = 0;
>> >> +     }
>> >> +     BUG_ON(state->pos >= state->master_list.count);
>> >> +     return state->master_list.list[state->pos++];
>> >> +}
>> >> +
>> >> +static freelist_idx_t next_random_slot(struct freelist_init_state *state)
>> >> +{
>> >> +     freelist_idx_t cur, entry;
>> >> +
>> >> +     entry = get_next_entry(state);
>> >> +
>> >> +     if (state->type != match) {
>> >> +             while ((entry + state->padding) >= state->count)
>> >> +                     entry = get_next_entry(state);
>> >> +             cur = entry + state->padding;
>> >> +             BUG_ON(cur >= state->count);
>> >> +     } else {
>> >> +             cur = entry;
>> >> +     }
>> >> +
>> >> +     return cur;
>> >> +}
>> >> +
>> >> +/* Shuffle the freelist initialization state based on pre-computed lists */
>> >> +static void shuffle_freelist(struct kmem_cache *cachep, struct page *page,
>> >> +                          unsigned int count)
>> >> +{
>> >> +     unsigned int i;
>> >> +     struct freelist_init_state state;
>> >> +
>> >> +     if (count < 2) {
>> >> +             for (i = 0; i < count; i++)
>> >> +                     set_free_obj(page, i, i);
>> >> +             return;
>> >> +     }
>> >> +
>> >> +     /* Last chunk is used already in this case */
>> >> +     if (OBJFREELIST_SLAB(cachep))
>> >> +             count--;
>> >> +
>> >> +     freelist_state_initialize(&state, count);
>> >> +     for (i = 0; i < count; i++)
>> >> +             set_free_obj(page, i, next_random_slot(&state));
>> >> +
>> >> +     if (OBJFREELIST_SLAB(cachep))
>> >> +             set_free_obj(page, i, i);
>> >
>> > Please consider last object of OBJFREELIST_SLAB cache, too.
>> >
>> > freelist_state_init()
>> > last_obj = next_randome_slot()
>> > page->freelist = XXX
>> > for (i = 0; i < count - 1; i++)
>> >         set_free_obj()
>> > set_free_obj(last_obj);
>> >
>> > Thanks.
>> >
>>
>> The current implementation take the last chunk by default before the
>> freelist is initialized. Do you want it to be randomized as well?
>
> Yes.
>
> Thanks.
>
>>
>> >> +}
>> >> +#else
>> >> +static inline void shuffle_freelist(struct kmem_cache *cachep,
>> >> +                                 struct page *page, unsigned int count) { }
>> >> +#endif /* CONFIG_FREELIST_RANDOM */
>> >> +
>> >>  static void cache_init_objs(struct kmem_cache *cachep,
>> >>                           struct page *page)
>> >>  {
>> >> @@ -2464,8 +2624,12 @@ static void cache_init_objs(struct kmem_cache *cachep,
>> >>                       kasan_poison_object_data(cachep, objp);
>> >>               }
>> >>
>> >> -             set_free_obj(page, i, i);
>> >> +             /* If enabled, initialization is done in shuffle_freelist */
>> >> +             if (!config_enabled(CONFIG_FREELIST_RANDOM))
>> >> +                     set_free_obj(page, i, i);
>> >>       }
>> >> +
>> >> +     shuffle_freelist(cachep, page, cachep->num);
>> >>  }
>> >>
>> >>  static void kmem_flagcheck(struct kmem_cache *cachep, gfp_t flags)
>> >> --
>> >> 2.8.0.rc3.226.g39d4020
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>