On Thu, Apr 4, 2024 at 4:44 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires > > > <benjamin.tissoires@xxxxxxxxxx> wrote: > > > > > > goto out; > > > > > > } > > > > > > + spin_lock(&t->sleepable_lock); > > > > > > drop_prog_refcnt(t); > > > > > > + spin_unlock(&t->sleepable_lock); > > > > > > > > > > this also looks odd. > > > > > > > > I basically need to protect "t->prog = NULL;" from happening while > > > > bpf_timer_work_cb is setting up the bpf program to be run. > > > > > > Ok. I think I understand the race you're trying to fix. > > > The bpf_timer_cancel_and_free() is doing > > > cancel_work() > > > and proceeds with > > > kfree_rcu(t, rcu); > > > > > > That's the only race and these extra locks don't help. Thanks a lot for pinpointing the location of the race. Indeed, when I read your email this morning I said "of course, this was obvious" :( > > > > > > > The t->prog = NULL is nothing to worry about. > > > The bpf_timer_work_cb() might still see callback_fn == NULL > > > "when it's being setup" and it's ok. > > > These locks don't help that. > > > > > > I suggest to drop sleepable_lock everywhere. > > > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough. > > > Add rcu_read_lock_trace() before calling bpf prog. > > > > > > The race to fix is above 'cancel_work + kfree_rcu' > > > since kfree_rcu might free 'struct bpf_hrtimer *t' > > > while the work is pending and work_queue internal > > > logic might UAF struct work_struct work. > > > By the time it may luckily enter bpf_timer_work_cb() it's too late. > > > The argument 'struct work_struct *work' might already be freed. > > > > > > To fix this problem, how about the following: > > > don't call kfree_rcu and instead queue the work to free it. > > > After cancel_work(&t->work); the work_struct can be reused. > > > So set it up to call "freeing callback" and do > > > schedule_work(&t->work); > > > > > > There is a big assumption here that new work won't be > > > executed before cancelled work completes. > > > Need to check with wq experts. > > > > > > Another approach is to do something smart with > > > cancel_work() return code. > > > If it returns true set a flag inside bpf_hrtimer and > > > make bpf_timer_work_cb() free(t) after bpf prog finishes. > > > > Looking through wq code... I think I have to correct myself. > > cancel_work and immediate free is probably fine from wq pov. > > It has this comment: > > worker->current_func(work); > > /* > > * While we must be careful to not use "work" after this, the trace > > * point will only record its address. > > */ > > trace_workqueue_execute_end(work, worker->current_func); > > > > the bpf_timer_work_cb() might still be running bpf prog. > > So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns, > > since kfree_rcu(t, rcu); could have freed it by then. > > There is also this code in net/rxrpc/rxperf.c > > cancel_work(&call->work); > > kfree(call); > > Correction to correction. > Above piece in rxrpc is buggy. > The following race is possible: > cpu 0 > process_one_work() > set_work_pool_and_clear_pending(work, pool->id, 0); > > cpu 1 > cancel_work() > kfree_rcu(work) > > worker->current_func(work); > > Here 'work' is a pointer to freed memory. > Though wq code will not be touching it, callback will UAF. > > Also what I proposed earlier as: > INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work(); > won't guarantee the ordering. > Since the callback function is different, > find_worker_executing_work() will consider it a separate work item. > > Another option is to to keep bpf_timer_work_cb callback > and add a 'bool free_me;' to struct bpf_hrtimer > and let the callback free it. > But it's also racy. > cancel_work() may return false, though worker->current_func(work) > wasn't called yet. > So we cannot set 'free_me' in bpf_timer_cancel_and_free() > in race free maner. > > After brainstorming with Tejun it seems the best is to use > another work_struct to call a different callback and do > cancel_work_sync() there. Works for me. I should be able to spina v6 soon enough, but I have a couple of remaining questions below: > > So we need something like: > > struct bpf_hrtimer { > union { > struct hrtimer timer; > + struct work_struct work; > }; > struct bpf_map *map; > struct bpf_prog *prog; > void __rcu *callback_fn; > void *value; > union { Are you sure we need an union here? If we get to call kfree_rcu() we need to have both struct rcu_head and sync_work, not one or the other. > struct rcu_head rcu; > + struct work_struct sync_work; > }; > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init() (like in my v2 or v3 IIRC). But that means that once a timer is initialized it needs to be of one or the other type (especially true with the first union in this struct). So should we reject during run time bpf_timer_set_callback() for sleepable timers and only allow bpf_timer_set_sleepable_cb() for those? (and the invert in the other case). This version of the patch allows for one timer to be used as softIRQ or WQ, depending on the timer_set_callback that is used. But it might be simpler for the kfree_rcu race to define the bpf_timer to be of one kind, so we are sure to call the correct kfree method. > }; > > 'work' will be used to call bpf_timer_work_cb. > 'sync_work' will be used to call cancel_work_sync() + kfree_rcu(). > > And, of course, > schedule_work(&t->sync_work); from bpf_timer_cancel_and_free() > instead of kfree_rcu. > Cheers, Benjamin