Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

David Vernet <void@xxxxxxxxxxxxx> · Fri, 21 Oct 2022 22:02:05 -0500

On Fri, Oct 21, 2022 at 03:57:15PM -0700, Yonghong Song wrote:

[...]

> > > > > +	 * could be modifying the local_storage->list now.
> > > > > +	 * Thus, no elem can be added-to or deleted-from the
> > > > > +	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
> > > > > +	 *
> > > > > +	 * It is racing with bpf_local_storage_map_free() alone
> > > > > +	 * when unlinking elem from the local_storage->list and
> > > > > +	 * the map's bucket->list.
> > > > > +	 */
> > > > > +	bpf_cgrp_storage_lock();
> > > > > +	raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > > > +	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
> > > > > +		bpf_selem_unlink_map(selem);
> > > > > +		free_cgroup_storage =
> > > > > +			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > > > 
> > > > This still requires a comment explaining why it's OK to overwrite
> > > > free_cgroup_storage with a previous value from calling
> > > > bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
> > > > a pretty weird programming pattern, and IMO doing this feels more
> > > > intentional and future-proof:
> > > > 
> > > > if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
> > > > 	free_cgroup_storage = true;
> > > 
> > > We have a comment a few lines below.
> > >    /* free_cgroup_storage should always be true as long as
> > >     * local_storage->list was non-empty.
> > >     */
> > >    if (free_cgroup_storage)
> > > 	kfree_rcu(local_storage, rcu);
> > 
> > IMO that comment doesn't provide much useful information -- it states an
> > assumption, but doesn't give a reason for it.
> > 
> > > I will add more explanation in the above code like
> > > 
> > > 	bpf_selem_unlink_map(selem);
> > > 	/* If local_storage list only have one element, the
> > > 	 * bpf_selem_unlink_storage_nolock() will return true.
> > > 	 * Otherwise, it will return false. The current loop iteration
> > > 	 * intends to remove all local storage. So the last iteration
> > > 	 * of the loop will set the free_cgroup_storage to true.
> > > 	 */
> > > 	free_cgroup_storage =
> > > 		bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);
> > 
> > Thanks, this is the type of comment I was looking for.
> > 
> > Also, I realize this was copy-pasted from a number of other possible
> > locations in the codebase which are doing the same thing, but I still
> > think this pattern is an odd and brittle way to do this. We're relying
> > on an abstracted implementation detail of
> > bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
> > that bpf_selem_unlink_storage_nolock() should probably be the one
> > invoking kfree_rcu() on behalf of callers in the first place.  It looks
> > like all of the callers end up calling kfree_rcu() on the struct
> > bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
> > so can we just move the responsibility of freeing the local storage
> > object down into bpf_selem_unlink_storage_nolock() where it's unlinked?
> 
> We probably cannot do this. bpf_selem_unlink_storage_nolock()
> is inside the rcu_read_lock() region. We do kfree_rcu() outside
> the rcu_read_lock() region.

kfree_rcu() is non-blocking and is safe to invoke from within an RCU
read region. If you invoke it within an RCU read region, the object will
not be kfree'd until (at least) you exit the current read region, so I
believe that the net effect here should be the same whether it's done in
bpf_selem_unlink_storage_nolock(), or in the caller after the RCU read
region is exited.

> > IMO this can be done in a separate patch set, if we decide it's worth
> > doing at all.
> > 
> > > > 
> > > > > +	}
> > > > > +	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
> > > > > +	bpf_cgrp_storage_unlock();
> > > > > +	rcu_read_unlock();
> > > > > +
> > > > > +	/* free_cgroup_storage should always be true as long as
> > > > > +	 * local_storage->list was non-empty.
> > > > > +	 */
> > > > > +	if (free_cgroup_storage)
> > > > > +		kfree_rcu(local_storage, rcu);
> > > > > +}
> > > > > +
> > > > > +static struct bpf_local_storage_data *
> > > > > +cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
> > > > > +{
> > > > > +	struct bpf_local_storage *cgroup_storage;
> > > > > +	struct bpf_local_storage_map *smap;
> > > > > +
> > > > > +	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
> > > > > +					       bpf_rcu_lock_held());
> > > > > +	if (!cgroup_storage)
> > > > > +		return NULL;
> > > > > +
> > > > > +	smap = (struct bpf_local_storage_map *)map;
> > > > > +	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
> > > > > +}
> > > > > +
> > > > > +static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
> > > > > +{
> > > > > +	struct bpf_local_storage_data *sdata;
> > > > > +	struct cgroup *cgroup;
> > > > > +	int fd;
> > > > > +
> > > > > +	fd = *(int *)key;
> > > > > +	cgroup = cgroup_get_from_fd(fd);
> > > > > +	if (IS_ERR(cgroup))
> > > > > +		return ERR_CAST(cgroup);
> > > > > +
> > > > > +	bpf_cgrp_storage_lock();
> > > > > +	sdata = cgroup_storage_lookup(cgroup, map, true);
> > > > > +	bpf_cgrp_storage_unlock();
> > > > > +	cgroup_put(cgroup);
> > > > > +	return sdata ? sdata->data : NULL;
> > > > > +}
> > > > 
> > > > Stanislav pointed out in the v1 revision that there's a lot of very
> > > > similar logic in task storage, and I think you'd mentioned that you were
> > > > going to think about generalizing some of that. Have you had a chance to
> > > > consider?
> > > 
> > > It is hard to have a common function for
> > > lookup_elem/update_elem/delete_elem(). They are quite different as each
> > > heavily involves
> > > task/cgroup-specific functions.
> > 
> > Yes agreed, each implementation is acquiring their own references, and
> > finding the backing element in whatever way it was implemented, etc.
> > 
> > > but map_alloc and map_free could have common helpers.
> > 
> > Agreed, and many of the static functions that are invoked on those paths
> > such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
> > as well. In general this feels like something we could pretty easily
> > simplify using something like a structure with callbacks to implement
> > the pieces of logic that are specific to each local storage type, such
> > as getting the struct bpf_local_storage __rcu
> > * pointer from some context (e.g.  cgroup_storage_ptr()). It doesn't
> > necessarily need to block this change, but IMO we should clean this up
> > soon because a lot of this is nearly a 100% copy-paste of other local
> > storage implementations.
> 
> Further refactoring is possible. Martin is working to simplify the
> locking mechanism. We can wait for that done before doing refactoring.

Sounds great, thanks!

- David