Re: [PATCH bpf-next v2 2/6] bpf: Implement cgroup storage available to non-cgroup-attached bpf progs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/21/22 8:02 PM, David Vernet wrote:
On Fri, Oct 21, 2022 at 03:57:15PM -0700, Yonghong Song wrote:

[...]

+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added-to or deleted-from the
+	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
+	 *
+	 * It is racing with bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	bpf_cgrp_storage_lock();
+	raw_spin_lock_irqsave(&local_storage->lock, flags);
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		bpf_selem_unlink_map(selem);
+		free_cgroup_storage =
+			bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

This still requires a comment explaining why it's OK to overwrite
free_cgroup_storage with a previous value from calling
bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like
a pretty weird programming pattern, and IMO doing this feels more
intentional and future-proof:

if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false))
	free_cgroup_storage = true;

We have a comment a few lines below.
    /* free_cgroup_storage should always be true as long as
     * local_storage->list was non-empty.
     */
    if (free_cgroup_storage)
	kfree_rcu(local_storage, rcu);

IMO that comment doesn't provide much useful information -- it states an
assumption, but doesn't give a reason for it.

I will add more explanation in the above code like

	bpf_selem_unlink_map(selem);
	/* If local_storage list only have one element, the
	 * bpf_selem_unlink_storage_nolock() will return true.
	 * Otherwise, it will return false. The current loop iteration
	 * intends to remove all local storage. So the last iteration
	 * of the loop will set the free_cgroup_storage to true.
	 */
	free_cgroup_storage =
		bpf_selem_unlink_storage_nolock(local_storage, selem, false, false);

Thanks, this is the type of comment I was looking for.

Also, I realize this was copy-pasted from a number of other possible
locations in the codebase which are doing the same thing, but I still
think this pattern is an odd and brittle way to do this. We're relying
on an abstracted implementation detail of
bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal
that bpf_selem_unlink_storage_nolock() should probably be the one
invoking kfree_rcu() on behalf of callers in the first place.  It looks
like all of the callers end up calling kfree_rcu() on the struct
bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true,
so can we just move the responsibility of freeing the local storage
object down into bpf_selem_unlink_storage_nolock() where it's unlinked?

We probably cannot do this. bpf_selem_unlink_storage_nolock()
is inside the rcu_read_lock() region. We do kfree_rcu() outside
the rcu_read_lock() region.

kfree_rcu() is non-blocking and is safe to invoke from within an RCU
read region. If you invoke it within an RCU read region, the object will
not be kfree'd until (at least) you exit the current read region, so I
believe that the net effect here should be the same whether it's done in
bpf_selem_unlink_storage_nolock(), or in the caller after the RCU read
region is exited.

Okay. we probably still want to do kfree_rcu outside
bpf_selem_unlink_storage_nolock() as the function is to unlink storage
for a particular selem.

We could move
	if (free_cgroup_storage)
		kfree_rcu(local_storage, rcu);
immediately after hlist_for_each_entry_safe() loop.
But I think putting that 'if' statement after rcu_read_unlock() is
slightly better as it will not increase the code inside the lock region.


IMO this can be done in a separate patch set, if we decide it's worth
doing at all.


+	}
+	raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+	bpf_cgrp_storage_unlock();
+	rcu_read_unlock();
+
+	/* free_cgroup_storage should always be true as long as
+	 * local_storage->list was non-empty.
+	 */
+	if (free_cgroup_storage)
+		kfree_rcu(local_storage, rcu);
+}
+
+static struct bpf_local_storage_data *
+cgroup_storage_lookup(struct cgroup *cgroup, struct bpf_map *map, bool cacheit_lockit)
+{
+	struct bpf_local_storage *cgroup_storage;
+	struct bpf_local_storage_map *smap;
+
+	cgroup_storage = rcu_dereference_check(cgroup->bpf_cgrp_storage,
+					       bpf_rcu_lock_held());
+	if (!cgroup_storage)
+		return NULL;
+
+	smap = (struct bpf_local_storage_map *)map;
+	return bpf_local_storage_lookup(cgroup_storage, smap, cacheit_lockit);
+}
+
+static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_local_storage_data *sdata;
+	struct cgroup *cgroup;
+	int fd;
+
+	fd = *(int *)key;
+	cgroup = cgroup_get_from_fd(fd);
+	if (IS_ERR(cgroup))
+		return ERR_CAST(cgroup);
+
+	bpf_cgrp_storage_lock();
+	sdata = cgroup_storage_lookup(cgroup, map, true);
+	bpf_cgrp_storage_unlock();
+	cgroup_put(cgroup);
+	return sdata ? sdata->data : NULL;
+}

Stanislav pointed out in the v1 revision that there's a lot of very
similar logic in task storage, and I think you'd mentioned that you were
going to think about generalizing some of that. Have you had a chance to
consider?

It is hard to have a common function for
lookup_elem/update_elem/delete_elem(). They are quite different as each
heavily involves
task/cgroup-specific functions.

Yes agreed, each implementation is acquiring their own references, and
finding the backing element in whatever way it was implemented, etc.

but map_alloc and map_free could have common helpers.

Agreed, and many of the static functions that are invoked on those paths
such as bpf_cgrp_storage_free(), bpf_cgrp_storage_lock(), etc possibly
as well. In general this feels like something we could pretty easily
simplify using something like a structure with callbacks to implement
the pieces of logic that are specific to each local storage type, such
as getting the struct bpf_local_storage __rcu
* pointer from some context (e.g.  cgroup_storage_ptr()). It doesn't
necessarily need to block this change, but IMO we should clean this up
soon because a lot of this is nearly a 100% copy-paste of other local
storage implementations.

Further refactoring is possible. Martin is working to simplify the
locking mechanism. We can wait for that done before doing refactoring.

Sounds great, thanks!

- David



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux