Re: [PATCH v2 bpf-next 5/5] bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY

Song Liu <songliubraving@xxxxxx> · Thu, 7 Jul 2022 02:11:46 +0000

> On Jul 6, 2022, at 6:18 PM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> 
> On Thu, 7 Jul 2022 00:19:07 +0000
> Song Liu <songliubraving@xxxxxx> wrote:
> 
>>>> In this specific race condition, register_bpf() will succeed, as it already
>>>> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.   
>>> 
>>> What else takes the tr->mutex ?  
>> 
>> tr->mutex is the local mutex for a single BPF trampoline, we only need to take
>> it when we make changes to the trampoline (add/remove fentry/fexit programs). 
>> 
>>> 
>>> If it preempts anything else taking that mutex, when this runs, then it
>>> needs to be careful.
>>> 
>>> You said this can happen when the live patch came first. This isn't racing
>>> against live patch, it's racing against anything that takes the tr->mutex
>>> and then adds a bpf trampoline to a location that has a live patch.  
>> 
>> There are a few scenarios here:
>> 1. Live patch is already applied, then a BPF trampoline is being registered 
>> to the same function. In bpf_trampoline_update(), register_fentry returns
>> -EAGAIN, and this will be resolved. 
> 
> Where will it be resolved?

bpf_trampoline_update() will set tr->indirect_call and goto again. Then the 
trampoline is re-prepared to be able to share with the IPMODIFY functions 
and register_fentry will succeed. 

> 
>> 
>> 2. BPF trampoline is already registered, then a live patch is being applied 
>> to the same function. The live patch code need to ask the bpf trampoline to
>> prepare the trampoline before live patch. This is done by calling 
>> bpf_tramp_ftrace_ops_func. 
>> 
>> 2.1 If nothing else is modifying the trampoline at the same time, 
>> bpf_tramp_ftrace_ops_func will succeed. 
>> 
>> 2.2 In rare cases, if something else is holding tr->mutex to make changes to 
>> the trampoline (add/remove fentry functions, etc.), mutex_trylock in 
>> bpf_tramp_ftrace_ops_func will fail, and live patch will fail. However, the 
>> change to BPF trampoline will still succeed. It is common for live patch to
>> retry, so we just need to try live patch again when no one is making changes 
>> to the BPF trampoline in parallel. 
> 
> If the live patch is going to try again, and the task doing the live
> patch is SCHED_FIFO, and the task holding the tr->mutex is SCHED_OTHER
> (or just a lower priority), then there is a chance that the live patch
> task preempted the tr->mutex owner, and let's say the tr->mutex owner
> is pinned to the CPU (by the user or whatever), then because the live
> patch task is in a loop trying to take that mutex, it will never let
> the owner continue.

Yeah, I got this scenario. I just don't think we should run live patch
with high priority. Well, maybe we shouldn't make such assumptions.  

> 
> Yes, this is a real scenario with trylock on mutexes. We hit it all the
> time in RT.
> 
>> 
>>> 
>>>> 
>>>> Since both livepatch and bpf trampoline changes are rare operations, I think 
>>>> the chance of the race condition is low enough. 
> 
> 
> A low race condition in a world that does this a billion times a day,
> ends up being not so rare.
> 
> I like to say, "I live in a world where the unlikely is very much likely!"
> 
> 
>>>> 
>>>> Does this make sense?
>>>> 
>>> 
>>> It's low, and if it is also a privileged operation then there's less to be
>>> concern about.  
>> 
>> Both live patch and BPF trampoline are privileged operations. 
> 
> This makes the issue less of an issue, but if there's an application
> that does this with setuid or something, there's a chance that it can
> be used by an attacker as well.
> 
>> 
>>> As if it is not, then we could have a way to deadlock the
>>> system. I'm more concerned that this will lead to a CVE than it just
>>> happening randomly. In other words, it only takes something that can run at
>>> a real-time priority to connect to a live patch location, and something
>>> that runs at a low priority to take a tr->mutex. If an attacker has both,
>>> then it can pin both to a CPU and then cause the deadlock to the system.
>>> 
>>> One hack to fix this is to add a msleep(1) in the failed case of the
>>> trylock. This will at least give the owner of the lock a millisecond to
>>> release it. This was what the RT patch use to do with spin_trylock() that
>>> was converted to a mutex (and we worked hard to remove all of them).  
>> 
>> The fix is really simple. But I still think we don't need it. We only hit
>> the trylock case for something with IPMODIFY. The non-privileged user 
>> should not be able to do that, right?
> 
> For now, perhaps. But what useful applications are there going to be in
> the future that performs these privileged operations on behalf of a
> non-privileged user?
> 
> In other words, if we can fix it now, we should, and avoid debugging
> this issue 5 years from now where it may take months to figure out.

Fair enough.. I guess we will just add the msleep(1) in the -EAGAIN
path. If this sounds good, I will send v3 with this change and more
comments. 

Thanks,
Song