Re: [RFC bpf-next v4 4/6] bpf: pin, translate, and unpin __uptr from syscalls.

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Fri, 6 Sep 2024 21:03:44 -0700




On 9/6/24 6:32 PM, Martin KaFai Lau wrote:
On 9/6/24 4:44 PM, Alexei Starovoitov wrote:
On Fri, Sep 6, 2024 at 1:11 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:

On 9/4/24 3:21 PM, Martin KaFai Lau wrote:
On 8/28/24 4:24 PM, Alexei Starovoitov wrote:
@@ -714,6 +869,11 @@ void bpf_obj_free_fields(const struct btf_record *rec,
void *obj)
                                  field->kptr.dtor(xchgd_field);
                          }
                          break;
+               case BPF_UPTR:
+                       if (*(void **)field_ptr)
+                               bpf_obj_unpin_uptr(field, *(void 
**)field_ptr);
+                       *(void **)field_ptr = NULL;
This one will be called from
   task_storage_delete->bpf_selem_free->bpf_obj_free_fields

and even if upin was safe to do from that context
we cannot just do:
*(void **)field_ptr = NULL;

since bpf prog might be running in parallel,
it could have just read that addr and now is using it.

The first thought of a way to fix this was to split
bpf_obj_free_fields() into the current one plus
bpf_obj_free_fields_after_gp()
that will do the above unpin bit.
and call the later one from bpf_selem_free_rcu()
while bpf_obj_free_fields() from bpf_selem_free()
will not touch uptr.

But after digging further I realized that task_storage
already switched to use bpf_ma, so the above won't work.

So we need something similar to BPF_KPTR_REF logic:
xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
and then delay of uptr unpin for that address into call_rcu.

Any better ideas?


I think the existing reuse_now arg in the bpf_selem_free can be used. reuse_now
(renamed from the earlier use_trace_rcu) was added to avoid call_rcu_tasks_trace
for the common case.

selem (in type "struct bpf_local_storage_elem") is the one exposed to the bpf 
prog.

bpf_selem_free knows whether a selem can be reused immediately based on the
caller. It is currently flagged in the reuse_now arg: "bpf_selem_free(...., bool
reuse_now)".

If a selem cannot be reuse_now (i.e. == false), it is currently going through
"call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu)". We can do
unpin_user_page() in the rcu callback.

A selem can be reuse_now (i.e. == true) if the selem is no longer needed because
either its owner (i.e. the task_struct here) is going away in free_task() or the
bpf map is being destructed in bpf_local_storage_map_free(). No bpf prog should
have a hold on the selem at this point. I think for these two cases, the
unpin_user_page() can be directly called in bpf_selem_free().

but there is also this path:
bpf_task_storage_delete -> task_storage_delete -> bpf_selem_free
  -> bpf_obj_free_fields

In this case bpf prog may still be looking at uptr address
and we cannot do unpin right away in bpf_obj_free_fields.

cannot unpin immediately in the bpf_task_storage_delete() path is understood. 
task_storage can be used in sleepable. It needs to wait for the tasks_trace and 
the regular rcu gp before unpin.

I forgot to mention earlier that bpf_task_storage_delete() will have the 
bpf_selem_free(..., reuse_now == false). It will then do the 
"call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_trace_rcu);". The unpin could 
happen in bpf_selem_free_trace_rcu() in this case. I am suggesting to unpin in 
bpf_selem_free_trace_rcu together with the selem free.

I just noticed the map and its btf_record are gone in 
bpf_selem_free_trace_rcu()... so won't work. :(

Thought about it more. Adding a rcu_barrier() to bpf_local_storage_map_free() 
may be enough. Then bpf_selem_free_rcu() will have the map->record to unpin.
will need to think about it more.


All other special fields in map value are ok,
since they are either relying on bpf_mem_alloc and
have rcu/rcu_tasks_trace gp
or extra indirection like timer/wq.

One existing bug is, from looking at patch 6, I don't think the free_task() case
can be "reuse_now == true" anymore because of the bpf_task_release kfunc did not
mark the previously obtained task_storage to be invalid:

data_task = bpf_task_from_pid(parent_pid);
ptr = bpf_task_storage_get(&datamap, data_task, 0, ...);
bpf_task_release(data_task);
if (!ptr)
         return 0;
/* The prog still holds a valid task storage ptr. */
udata = ptr->udata;

It can be fixed by marking the ref_obj_id of the "ptr". Although it is more
correct to make the task storage "ptr" invalid after task_release, it may break
the existing progs.

Are you suggesting that bpf_task_release should invalidate all pointers
fetched from map value?

I was thinking at least the map value ptr itself needs to be invalidated.

That will work, but it's not an issue for other special fields in there
like kptr.
So this invalidation would be need only for uptr which feels
weird to special case it and probably will be confusing to users writing
such programs.

hmm... I haven't thought about the other pointer fields that read before the 
task_release().

Agreed, it is hard to use if only marks uptr invalid. Thinking about it. Even 
marking the map value ptr invalid while other previously read fields keep 
working is also the same weirdness.

Above bpf prog example should be ok to use.
We only need to delay unpin after rcu/rcu_task_trace gp.
Hence my proposal in bpf_obj_free_fields() do:
  case UPTR:
    xchgd_field = (void *)xchg((unsigned long *)field_ptr, 0);
    call_rcu(...) to unpin.

Agree that call_rcu() here is the only option. It probably needs to go through 
the tasks_trace gp also.

Can the page->rcu_head be used here?