On Mon, Oct 18, 2021 at 10:06 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > > In commit fda31c50292a ("signal: avoid double atomic counter > increments for user accounting") Linus made a clever optimization to > how rlimits and the struct user_struct. Unfortunately that > optimization does not work in the obvious way when moved to nested > rlimits. The problem is that the last decrement of the per user > namespace per user sigpending counter might also be the last decrement > of the sigpending counter in the parent user namespace as well. Which > means that simply freeing the leaf ucount in __free_sigqueue is not > enough. > > Maintain the optimization and handle the tricky cases by introducing > inc_rlimit_get_ucounts and dec_rlimit_put_ucounts. > > By moving the entire optimization into functions that perform all of > the work it becomes possible to ensure that every level is handled > properly. > > The new function inc_rlimit_get_ucounts returns 0 on failure to > increment the ucount. This is different than inc_rlimit_ucounts which > increments the ucounts and returns LONG_MAX if the ucount counter has > exceeded it's maximum or it wrapped (to indicate the counter needs to > decremented). > > I wish we had a single user to account all pending signals to across > all of the threads of a process so this complexity was not necessary > > Cc: stable@xxxxxxxxxxxxxxx > Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") > v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133 > Tested-by: Rune Kleveland <rune.kleveland@xxxxxxxxxxxx> > Tested-by: Yu Zhao <yuzhao@xxxxxxxxxx> > Tested-by: Jordan Glover <Golden_Miller83@xxxxxxxxxxxxx> > Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > --- > > This is my version of the fix with all of the feedback rolled in. > I have tested it and believe this is ready to send out. > > If folks code take a once over and see if I have spotted things. > > For the people who are testing or have tested this I have added your > tested-by's please let me know if you mind. > > Eric Retested on the latest 5.15-rc6. This patch fixes the following crash: [ 3307.621443] ================================================================== [ 3307.628890] BUG: KASAN: use-after-free in dec_ucount+0x50/0xd8 [ 3307.634903] Write of size 8 at addr ffffff80a5e4c520 by task kworker/7:3/201 [ 3307.642149] [ 3307.643695] CPU: 7 PID: 201 Comm: kworker/7:3 Not tainted 5.15.0-rc6-lockdep+ #7 [ 3307.651291] Hardware name: Google Lazor (rev3+) with KB Backlight (DT) [ 3307.658001] Workqueue: events free_user_ns [ 3307.662231] Call trace: [ 3307.664750] dump_backtrace+0x0/0x42c [ 3307.668522] show_stack+0x24/0x30 [ 3307.671935] dump_stack_lvl+0xd0/0x100 [ 3307.675797] print_address_description+0x30/0x304 [ 3307.680646] kasan_report+0x190/0x1d8 [ 3307.684419] kasan_check_range+0x1ac/0x1bc [ 3307.688630] __kasan_check_write+0x44/0x54 [ 3307.692852] dec_ucount+0x50/0xd8 [ 3307.696266] free_user_ns+0x1b0/0x288 [ 3307.700036] process_one_work+0x7b4/0x1130 [ 3307.704251] worker_thread+0x800/0xcf4 [ 3307.708111] kthread+0x2a8/0x358 [ 3307.711437] ret_from_fork+0x10/0x20 [ 3307.715121] [ 3307.716664] Allocated by task 6564: [ 3307.720253] kasan_save_stack+0x38/0x68 [ 3307.724206] __kasan_kmalloc+0x9c/0xb8 [ 3307.728068] kmem_cache_alloc_trace+0x260/0x32c [ 3307.732729] alloc_ucounts+0x150/0x374 [ 3307.736589] set_cred_ucounts+0x178/0x240 [ 3307.740714] __sys_setresuid+0x31c/0x4f8 [ 3307.744754] __arm64_sys_setresuid+0x84/0x98 [ 3307.749153] invoke_syscall+0xcc/0x2bc [ 3307.753012] el0_svc_common+0x118/0x1ec [ 3307.756961] do_el0_svc_compat+0x50/0x60 [ 3307.761005] el0_svc_compat+0x5c/0xec [ 3307.764774] el0t_32_sync_handler+0xc0/0xf0 [ 3307.769083] el0t_32_sync+0x1a4/0x1a8 [ 3307.772852] [ 3307.774399] The buggy address belongs to the object at ffffff80a5e4c500 [ 3307.774399] which belongs to the cache kmalloc-256 of size 256 [ 3307.787250] The buggy address is located 32 bytes inside of [ 3307.787250] 256-byte region [ffffff80a5e4c500, ffffff80a5e4c600) [ 3307.799216] The buggy address belongs to the page: [ 3307.804141] page:fffffffe02979200 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffff80a5e4c100 pfn:0x125e48 [ 3307.815127] head:fffffffe02979200 order:3 compound_mapcount:0 compound_pincount:0 [ 3307.822808] flags: 0x8000000000010200(slab|head|zone=2) [ 3307.828187] raw: 8000000000010200 fffffffe0250ba08 fffffffe00f04a08 ffffff808000c980 [ 3307.836148] raw: ffffff80a5e4c100 0000000000200001 00000001ffffffff 0000000000000000 [ 3307.844104] page dumped because: kasan: bad access detected [ 3307.849837] [ 3307.851381] Memory state around the buggy address: [ 3307.856307] ffffff80a5e4c400: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 3307.863729] ffffff80a5e4c480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 3307.871146] >ffffff80a5e4c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 3307.878562] ^ [ 3307.882956] ffffff80a5e4c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 3307.890375] ffffff80a5e4c600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 3307.897800] ================================================================== > include/linux/user_namespace.h | 2 ++ > kernel/signal.c | 25 +++++------------ > kernel/ucount.c | 49 ++++++++++++++++++++++++++++++++++ > 3 files changed, 57 insertions(+), 19 deletions(-) > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index eb70cabe6e7f..33a4240e6a6f 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t > > long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v); > bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v); > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type); > +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type); > bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max); > > static inline void set_rlimit_ucount_max(struct user_namespace *ns, > diff --git a/kernel/signal.c b/kernel/signal.c > index a3229add4455..13d2505a14a0 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -425,22 +425,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, > */ > rcu_read_lock(); > ucounts = task_ucounts(t); > - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1); > - switch (sigpending) { > - case 1: > - if (likely(get_ucounts(ucounts))) > - break; > - fallthrough; > - case LONG_MAX: > - /* > - * we need to decrease the ucount in the userns tree on any > - * failure to avoid counts leaking. > - */ > - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1); > - rcu_read_unlock(); > - return NULL; > - } > + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); > rcu_read_unlock(); > + if (!sigpending) > + return NULL; > > if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) { > q = kmem_cache_alloc(sigqueue_cachep, gfp_flags); > @@ -449,8 +437,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, > } > > if (unlikely(q == NULL)) { > - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) > - put_ucounts(ucounts); > + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); > } else { > INIT_LIST_HEAD(&q->list); > q->flags = sigqueue_flags; > @@ -463,8 +450,8 @@ static void __sigqueue_free(struct sigqueue *q) > { > if (q->flags & SIGQUEUE_PREALLOC) > return; > - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) { > - put_ucounts(q->ucounts); > + if (q->ucounts) { > + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING); > q->ucounts = NULL; > } > kmem_cache_free(sigqueue_cachep, q); > diff --git a/kernel/ucount.c b/kernel/ucount.c > index bb51849e6375..eb03f3c68375 100644 > --- a/kernel/ucount.c > +++ b/kernel/ucount.c > @@ -284,6 +284,55 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v) > return (new == 0); > } > > +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, > + struct ucounts *last, enum ucount_type type) > +{ > + struct ucounts *iter, *next; > + for (iter = ucounts; iter != last; iter = next) { > + long dec = atomic_long_add_return(-1, &iter->ucount[type]); > + WARN_ON_ONCE(dec < 0); > + next = iter->ns->ucounts; > + if (dec == 0) > + put_ucounts(iter); > + } > +} > + > +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type) > +{ > + do_dec_rlimit_put_ucounts(ucounts, NULL, type); > +} > + > +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type) > +{ > + /* Caller must hold a reference to ucounts */ > + struct ucounts *iter; > + long dec, ret = 0; > + > + for (iter = ucounts; iter; iter = iter->ns->ucounts) { > + long max = READ_ONCE(iter->ns->ucount_max[type]); > + long new = atomic_long_add_return(1, &iter->ucount[type]); > + if (new < 0 || new > max) > + goto unwind; > + if (iter == ucounts) > + ret = new; > + /* > + * Grab an extra ucount reference for the caller when > + * the rlimit count was previously 0. > + */ > + if (new != 1) > + continue; > + if (!get_ucounts(iter)) > + goto dec_unwind; > + } > + return ret; > +dec_unwind: > + dec = atomic_long_add_return(-1, &iter->ucount[type]); > + WARN_ON_ONCE(dec < 0); > +unwind: > + do_dec_rlimit_put_ucounts(ucounts, iter, type); > + return 0; > +} > + > bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max) > { > struct ucounts *iter; > -- > 2.20.1 >