On Sun, Oct 23, 2022 at 09:45:35AM -0700, Yonghong Song wrote: > > > > > > > + * could be modifying the local_storage->list now. > > > > > > > + * Thus, no elem can be added-to or deleted-from the > > > > > > > + * local_storage->list by the bpf_prog or by the bpf-map's syscall. > > > > > > > + * > > > > > > > + * It is racing with bpf_local_storage_map_free() alone > > > > > > > + * when unlinking elem from the local_storage->list and > > > > > > > + * the map's bucket->list. > > > > > > > + */ > > > > > > > + bpf_cgrp_storage_lock(); > > > > > > > + raw_spin_lock_irqsave(&local_storage->lock, flags); > > > > > > > + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { > > > > > > > + bpf_selem_unlink_map(selem); > > > > > > > + free_cgroup_storage = > > > > > > > + bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > > > > > > > > > This still requires a comment explaining why it's OK to overwrite > > > > > > free_cgroup_storage with a previous value from calling > > > > > > bpf_selem_unlink_storage_nolock(). Even if that is safe, this looks like > > > > > > a pretty weird programming pattern, and IMO doing this feels more > > > > > > intentional and future-proof: > > > > > > > > > > > > if (bpf_selem_unlink_storage_nolock(local_storage, selem, false, false)) > > > > > > free_cgroup_storage = true; > > > > > > > > > > We have a comment a few lines below. > > > > > /* free_cgroup_storage should always be true as long as > > > > > * local_storage->list was non-empty. > > > > > */ > > > > > if (free_cgroup_storage) > > > > > kfree_rcu(local_storage, rcu); > > > > > > > > IMO that comment doesn't provide much useful information -- it states an > > > > assumption, but doesn't give a reason for it. > > > > > > > > > I will add more explanation in the above code like > > > > > > > > > > bpf_selem_unlink_map(selem); > > > > > /* If local_storage list only have one element, the > > > > > * bpf_selem_unlink_storage_nolock() will return true. > > > > > * Otherwise, it will return false. The current loop iteration > > > > > * intends to remove all local storage. So the last iteration > > > > > * of the loop will set the free_cgroup_storage to true. > > > > > */ > > > > > free_cgroup_storage = > > > > > bpf_selem_unlink_storage_nolock(local_storage, selem, false, false); > > > > > > > > Thanks, this is the type of comment I was looking for. > > > > > > > > Also, I realize this was copy-pasted from a number of other possible > > > > locations in the codebase which are doing the same thing, but I still > > > > think this pattern is an odd and brittle way to do this. We're relying > > > > on an abstracted implementation detail of > > > > bpf_selem_unlink_storage_nolock() for correctness, which IMO is a signal > > > > that bpf_selem_unlink_storage_nolock() should probably be the one > > > > invoking kfree_rcu() on behalf of callers in the first place. It looks > > > > like all of the callers end up calling kfree_rcu() on the struct > > > > bpf_local_storage * if bpf_selem_unlink_storage_nolock() returns true, > > > > so can we just move the responsibility of freeing the local storage > > > > object down into bpf_selem_unlink_storage_nolock() where it's unlinked? > > > > > > We probably cannot do this. bpf_selem_unlink_storage_nolock() > > > is inside the rcu_read_lock() region. We do kfree_rcu() outside > > > the rcu_read_lock() region. > > > > kfree_rcu() is non-blocking and is safe to invoke from within an RCU > > read region. If you invoke it within an RCU read region, the object will > > not be kfree'd until (at least) you exit the current read region, so I > > believe that the net effect here should be the same whether it's done in > > bpf_selem_unlink_storage_nolock(), or in the caller after the RCU read > > region is exited. > > Okay. we probably still want to do kfree_rcu outside > bpf_selem_unlink_storage_nolock() as the function is to unlink storage > for a particular selem. Meaning, it's for unlinking a specific element rather than the whole list, so it's not the right place to free the larger struct bpf_local_storage * container? If that's your point (and please clarify if it's not and I'm misunderstanding) then I agree that's true, but unfortunately whether the API likes it or not, it's tied itself to the lifetime of the larger struct bpf_local_storage * by returning a bool that says whether the caller needs to free that local storage pointer. AFAICT, with the current API / implementation, if the caller drops this value on the floor, the struct bpf_local_storage * is leaked, which means that it's a leaky API. That being said, I think I agree with you that just moving kfree_rcu() into bpf_selem_unlink_storage_nolock() may not be appropriate, but overall it feels like this pattern / API has room for improvement. The fact that the (now) only three callers of this function have copy-pasted code that's doing the exact same thing to free the is local storage object is in my opinion a testament to that. Anyways, none of that needs to block this patch set. I acked this in your latest version, but I think this should be cleaned up by someone in the near future; certainly before we add another local storage variant. > We could move > if (free_cgroup_storage) > kfree_rcu(local_storage, rcu); > immediately after hlist_for_each_entry_safe() loop. > But I think putting that 'if' statement after rcu_read_unlock() is > slightly better as it will not increase the code inside the lock region. Yeah, if it's not abstracted by the bpf_local_storage APIs, it might as well just be freed outside of the critical section. Thanks, David