Re: [PATCH v2 bpf-next 1/6] bpf: Introduce bpf sk local storage

Yonghong Song <yhs@xxxxxx> · Fri, 26 Apr 2019 18:48:33 +0000

On 4/26/19 11:04 AM, Martin Lau wrote:
> On Fri, Apr 26, 2019 at 10:27:16AM -0700, Yonghong Song wrote:
>>
>>
>> On 4/25/19 5:10 PM, Martin KaFai Lau wrote:
>>> After allowing a bpf prog to
>>> - directly read the skb->sk ptr
>>> - get the fullsock bpf_sock by "bpf_sk_fullsock()"
>>> - get the bpf_tcp_sock by "bpf_tcp_sock()"
>>> - get the listener sock by "bpf_get_listener_sock()"
>>> - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
>>>     into different bpf running context.
>>>
>>> this patch is another effort to make bpf's network programming
>>> more intuitive to do (together with memory and performance benefit).
>>>
>>> When bpf prog needs to store data for a sk, the current practice is to
>>> define a map with the usual 4-tuples (src/dst ip/port) as the key.
>>> If multiple bpf progs require to store different sk data, multiple maps
>>> have to be defined.  Hence, wasting memory to store the duplicated
>>> keys (i.e. 4 tuples here) in each of the bpf map.
>>> [ The smallest key could be the sk pointer itself which requires
>>>     some enhancement in the verifier and it is a separate topic. ]
>>>
>>> Also, the bpf prog needs to clean up the elem when sk is freed.
>>> Otherwise, the bpf map will become full and un-usable quickly.
>>> The sk-free tracking currently could be done during sk state
>>> transition (e.g. BPF_SOCK_OPS_STATE_CB).
>>>
>>> The size of the map needs to be predefined which then usually ended-up
>>> with an over-provisioned map in production.  Even the map was re-sizable,
>>> while the sk naturally come and go away already, this potential re-size
>>> operation is arguably redundant if the data can be directly connected
>>> to the sk itself instead of proxy-ing through a bpf map.
>>>
>>> This patch introduces sk->sk_bpf_storage to provide local storage space
>>> at sk for bpf prog to use.  The space will be allocated when the first bpf
>>> prog has created data for this particular sk.
>>>
>>> The design optimizes the bpf prog's lookup (and then optionally followed by
>>> an inline update).  bpf_spin_lock should be used if the inline update needs
>>> to be protected.
>>>
>>> BPF_MAP_TYPE_SK_STORAGE:
>>> -----------------------
>>> To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
>>> this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
>>> be created to fit different bpf progs' needs.  The map enforces
>>> BTF to allow printing the sk-local-storage during a system-wise
>>> sk dump (e.g. "ss -ta") in the future.
>>>
>>> The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
>>> a "sk-local-storage" data from a particular sk.
>>> Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
>>> particular "type" of "sk-local-storage" data can then be stored in any sk.
>>>
>>> The main purposes of this map are mostly:
>>> 1. Define the size of a "sk-local-storage" type.
>>> 2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
>>>      map-id, map-btf...etc.)
>>> 3. Keep track of all sk's storages of this "type" and clean them up
>>>      when the map is freed.
>>>
>>> sk->sk_bpf_storage:
>>> ------------------
>>> The main lookup/update/delete is done on sk->sk_bpf_storage (which
>>> is a "struct bpf_sk_storage").  When doing a lookup,
>>> the "map" pointer is now used as the "key" to search on the
>>> sk_storage->list.  The "map" pointer is actually serving
>>> as the "type" of the "sk-local-storage" that is being
>>> requested.
>>>
>>> To allow very fast lookup, it should be as fast as looking up an
>>> array at a stable-offset.  At the same time, it is not ideal to
>>> set a hard limit on the number of sk-local-storage "type" that the
>>> system can have.  Hence, this patch takes a cache approach.
>>> The last search result from sk_storage->list is cached in
>>> sk_storage->cache[] which is a stable sized array.  Each
>>> "sk-local-storage" type has a stable offset to the cache[] array.
>>> In the future, a map's flag could be introduced to do cache
>>> opt-out/enforcement if it became necessary.
>>>
>>> All data will be removed from sk->sk_bpf_storage during
>>> sk destruction.
>>>
>>> bpf_sk_storage_get() and bpf_sk_storage_delete():
>>> ------------------------------------------------
>>> Instead of using bpf_map_(lookup|update|delete)_elem(),
>>> the bpf prog needs to use the new helper bpf_sk_storage_get() and
>>> bpf_sk_storage_delete().  The verifier can then enforce the
>>> ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
>>> "create" new elem if one does not exist in the sk.  It is done by
>>> the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
>>> provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
>>> The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
>>> it has eliminated the potential use cases for an equivalent
>>> bpf_map_update_elem() API (for bpf_prog) in this patch.
>>>
>>> Misc notes:
>>> ----------
>>> 1. map_get_next_key is not supported.  From the userspace syscall
>>>      perspective,  the map has the socket fd as the key while the map
>>>      can be shared by pinned-file or map-id.
>>>
>>>      Since btf is enforced, the existing "ss" could be enhanced to pretty
>>>      print the local-storage.
>>>
>>>      Supporting a kernel defined btf with 4 tuples as the return key could
>>>      be explored later also.
>>>
>>> 2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
>>>      e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
>>>      Please refer to the source code comments for the details in
>>>      synchronization cases and considerations.
>>>
>>> 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
>>>
>>> Benchmark:
>>> ---------
>>> Here is the benchmark data collected by turning on
>>> the "kernel.bpf_stats_enabled" sysctl.
>>> Two bpf progs are tested:
>>>
>>> One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
>>> sk ptr as the key. (verifier is modified to support sk ptr as the key
>>> That should have shortened the key lookup time.)
>>>
>>> Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
>>>
>>> Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
>>> each egress skb and then bump the cnt.  netperf is used to drive
>>> data with 4096 connected UDP sockets.
>>>
>>> BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
>>> 27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
>>>       loaded_at 2019-04-15T13:46:39-0700  uid 0
>>>       xlated 344B  jited 258B  memlock 4096B  map_ids 16
>>>       btf_id 5
>>>
>>> BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
>>> 30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
>>>       loaded_at 2019-04-15T13:47:54-0700  uid 0
>>>       xlated 168B  jited 156B  memlock 4096B  map_ids 17
>>>       btf_id 6
>>>
>>> Here is a high-level picture on how are the objects organized:
>>>
>>>          sk
>>>       +------+
>>>       |      |
>>>       |      |
>>>       |      |
>>>       |*sk_bpf_storage-----> bpf_sk_storage
>>>       +------+                 +-------+
>>>                    +-----------+ list  |
>>>                    |           |       |
>>>                    |           |       |
>>>                    |           |       |
>>>                    |           +-------+
>>>                    |
>>>                    |     elem
>>>                    |  +--------+
>>>                    +->| snode  |
>>>                    |  +--------+
>>>                    |  |  data  |          bpf_map
>>>                    |  +--------+        +---------+
>>>                    |  |map_node|<-+-----+  list   |
>>>                    |  +--------+  |     |         |
>>>                    |              |     |         |
>>>                    |     elem     |     |         |
>>>                    |  +--------+  |     +---------+
>>>                    +->| snode  |  |
>>>                       +--------+  |
>>>      bpf_map          |  data  |  |
>>>    +---------+        +--------+  |
>>>    |  list   +------->|map_node|  |
>>>    |         |        +--------+  |
>>>    |         |                    |
>>>    |         |           elem     |
>>>    +---------+        +--------+  |
>>>                    +->| snode  |  |
>>>                    |  +--------+  |
>>>                    |  |  data  |  |
>>>                    |  +--------+  |
>>>                    |  |map_node|<-+
>>>                    |  +--------+
>>>                    |
>>>                    |
>>>                    |          +-------+
>>>        sk          +----------| list  |
>>>     +------+                  |       |
>>>     |      |                  |       |
>>>     |      |                  |       |
>>>     |      |                  +-------+
>>>     |*sk_bpf_storage------->bpf_sk_storage
>>>     +------+
>>>
>>> Signed-off-by: Martin KaFai Lau <kafai@xxxxxx>
>>
>> Thanks. Generally look good. I have a few comments below.
>>
>>> ---
>>>    include/linux/bpf.h          |   2 +
>>>    include/linux/bpf_types.h    |   1 +
>>>    include/net/bpf_sk_storage.h |  13 +
>>>    include/net/sock.h           |   5 +
>>>    include/uapi/linux/bpf.h     |  44 +-
>>>    kernel/bpf/syscall.c         |   3 +-
>>>    kernel/bpf/verifier.c        |  27 +-
>>>    net/bpf/test_run.c           |   2 +
>>>    net/core/Makefile            |   1 +
>>>    net/core/bpf_sk_storage.c    | 796 +++++++++++++++++++++++++++++++++++
>>>    net/core/filter.c            |  12 +
>>>    net/core/sock.c              |   5 +
>>>    12 files changed, 906 insertions(+), 5 deletions(-)
>>>    create mode 100644 include/net/bpf_sk_storage.h
>>>    create mode 100644 net/core/bpf_sk_storage.c
>>>
>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>> index f15432d90728..0d2788beacd2 100644
>>> --- a/include/linux/bpf.h
>>> +++ b/include/linux/bpf.h
>>> @@ -184,6 +184,7 @@ enum bpf_arg_type {
>>>    	ARG_PTR_TO_MAP_KEY,	/* pointer to stack used as map key */
>>>    	ARG_PTR_TO_MAP_VALUE,	/* pointer to stack used as map value */
>>>    	ARG_PTR_TO_UNINIT_MAP_VALUE,	/* pointer to valid memory used to store a map value */
>>> +	ARG_PTR_TO_MAP_VALUE_OR_NULL,	/* pointer to stack used as map value or NULL */
>>>    
>>>    	/* the following constraints used to prototype bpf_memcmp() and other
>>>    	 * functions that access data on eBPF program stack
>>> @@ -204,6 +205,7 @@ enum bpf_arg_type {
>>>    	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
>>>    	ARG_PTR_TO_INT,		/* pointer to int */
>>>    	ARG_PTR_TO_LONG,	/* pointer to long */
>>> +	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
>>>    };
>>>    
>>>    /* type of values returned from helper functions */
>>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>>> index d26991a16894..eb1e4f867a5e 100644
>>> --- a/include/linux/bpf_types.h
>>> +++ b/include/linux/bpf_types.h
>>> @@ -60,6 +60,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
>>>    BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
>>>    #ifdef CONFIG_NET
>>>    BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
>>> +BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
>>>    #if defined(CONFIG_BPF_STREAM_PARSER)
>>>    BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
>>>    BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
>>> diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h
>>> new file mode 100644
>>> index 000000000000..b9dcb02e756b
>>> --- /dev/null
>>> +++ b/include/net/bpf_sk_storage.h
>>> @@ -0,0 +1,13 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/* Copyright (c) 2019 Facebook */
>>> +#ifndef _BPF_SK_STORAGE_H
>>> +#define _BPF_SK_STORAGE_H
>>> +
>>> +struct sock;
>>> +
>>> +void bpf_sk_storage_free(struct sock *sk);
>>> +
>>> +extern const struct bpf_func_proto bpf_sk_storage_get_proto;
>>> +extern const struct bpf_func_proto bpf_sk_storage_delete_proto;
>>> +
>>> +#endif /* _BPF_SK_STORAGE_H */
>> [...]
>>> +
>>> +static struct bpf_sk_storage_data *
>>> +__sk_storage_lookup(struct bpf_sk_storage *sk_storage,
>>> +		    struct bpf_sk_storage_map *smap,
>>> +		    bool cacheit_lockit)
>>> +{
>>> +	struct bpf_sk_storage_data *sdata;
>>> +	struct bpf_sk_storage_elem *selem;
>>> +
>>> +	/* Fast path (cache hit) */
>>> +	sdata = rcu_dereference(sk_storage->cache[smap->cache_idx]);
>>> +	if (sdata && rcu_access_pointer(sdata->smap) == smap)
>>> +		return sdata;
>>> +
>>> +	/* Slow path (cache miss) */
>>> +	hlist_for_each_entry_rcu(selem, &sk_storage->list, snode)
>>> +		if (rcu_access_pointer(SDATA(selem)->smap) == smap)
>>> +			break;
>>> +
>>> +	if (!selem)
>>> +		return NULL;
>>> +
>>> +	sdata = SDATA(selem);
>>> +	if (cacheit_lockit) {
>>> +		/* spinlock is needed to avoid racing with the
>>> +		 * parallel delete.  Otherwise, publishing an already
>>> +		 * deleted sdata to the cache will become a use-after-free
>>> +		 * problem in the next __sk_storage_lookup().
>>> +		 */
>>> +		raw_spin_lock_bh(&sk_storage->lock);
>>> +		if (selem_linked_to_sk(selem))
>>> +			rcu_assign_pointer(sk_storage->cache[smap->cache_idx],
>>> +					   sdata);
>>> +		raw_spin_unlock_bh(&sk_storage->lock);
>>> +	}
>>> +
>>> +	return sdata;
>>> +}
>>> +
>>> +static struct bpf_sk_storage_data *
>>> +sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit)
>>> +{
>>> +	struct bpf_sk_storage *sk_storage;
>>> +	struct bpf_sk_storage_map *smap;
>>> +
>>> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
>>> +	if (!sk_storage)
>>> +		return NULL;
>>> +
>>> +	smap = (struct bpf_sk_storage_map *)map;
>>> +	return __sk_storage_lookup(sk_storage, smap, cacheit_lockit);
>>> +}
>>> +
>>> +static int check_flags(const struct bpf_sk_storage_data *old_sdata,
>>> +		       u64 map_flags)
>>> +{
>>> +	if (old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
>>> +		/* elem already exists */
>>> +		return -EEXIST;
>>> +
>>> +	if (!old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
>>> +		/* elem doesn't exist, cannot update it */
>>> +		return -ENOENT;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int sk_storage_alloc(struct sock *sk,
>>> +			    struct bpf_sk_storage_map *smap,
>>> +			    struct bpf_sk_storage_elem *first_selem)
>>> +{
>>> +	struct bpf_sk_storage *prev_sk_storage, *sk_storage;
>>> +	int err;
>>> +
>>> +	err = omem_charge(sk, sizeof(*sk_storage));
>>> +	if (err)
>>> +		return err;
>>> +
>>> +	sk_storage = kzalloc(sizeof(*sk_storage), GFP_ATOMIC | __GFP_NOWARN);
>>> +	if (!sk_storage) {
>>> +		err = -ENOMEM;
>>> +		goto uncharge;
>>> +	}
>>> +	INIT_HLIST_HEAD(&sk_storage->list);
>>> +	raw_spin_lock_init(&sk_storage->lock);
>>> +	sk_storage->sk = sk;
>>> +
>>> +	__selem_link_sk(sk_storage, first_selem);
>>> +	selem_link_map(smap, first_selem);
>>> +	/* Publish sk_storage to sk.  sk->sk_lock cannot be acquired.
>>> +	 * Hence, atomic ops is used to set sk->sk_bpf_storage
>>> +	 * from NULL to the newly allocated sk_storage ptr.
>>> +	 *
>>> +	 * From now on, the sk->sk_bpf_storage pointer is protected
>>> +	 * by the sk_storage->lock.  Hence,  when freeing
>>> +	 * the sk->sk_bpf_storage, the sk_storage->lock must
>>> +	 * be held before setting sk->sk_bpf_storage to NULL.
>>> +	 */
>>> +	prev_sk_storage = cmpxchg((struct bpf_sk_storage **)&sk->sk_bpf_storage,
>>> +				  NULL, sk_storage);
>>> +	if (unlikely(prev_sk_storage)) {
>>> +		selem_unlink_map(first_selem);
>>> +		err = -EAGAIN;
>>> +		goto uncharge;
>>> +
>>> +		/* Note that even first_selem was linked to smap's
>>> +		 * bucket->list, first_selem can be freed immediately
>>> +		 * (instead of kfree_rcu) because
>>> +		 * bpf_sk_storage_map_free() does a
>>> +		 * synchronize_rcu() before walking the bucket->list.
>>> +		 * Hence, no one is accessing selem from the
>>> +		 * bucket->list under rcu_read_lock().
>>> +		 */
>>> +	}
>>> +
>>> +	return 0;
>>> +
>>> +uncharge:
>>> +	kfree(sk_storage);
>>> +	atomic_sub(sizeof(*sk_storage), &sk->sk_omem_alloc);
>>> +	return err;
>>> +}
>>> +
>>> +/* sk cannot be going away because it is linking new elem
>>> + * to sk->sk_bpf_storage. (i.e. sk->sk_refcnt cannot be 0).
>>> + * Otherwise, it will become a leak (and other memory issues
>>> + * during map destruction).
>>> + */
>>> +static struct bpf_sk_storage_data *sk_storage_update(struct sock *sk,
>>> +						     struct bpf_map *map,
>>> +						     void *value,
>>> +						     u64 map_flags)
>>> +{
>>> +	struct bpf_sk_storage_data *old_sdata = NULL;
>>> +	struct bpf_sk_storage_elem *selem;
>>> +	struct bpf_sk_storage *sk_storage;
>>> +	struct bpf_sk_storage_map *smap;
>>> +	int err;
>>> +
>>> +	/* BPF_EXIST and BPF_NOEXIST cannot be both set */
>>> +	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) ||
>>> +	    /* BPF_F_LOCK can only be used in a value with spin_lock */
>>> +	    unlikely((map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map)))
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	smap = (struct bpf_sk_storage_map *)map;
>>> +	sk_storage = rcu_dereference(sk->sk_bpf_storage);
>>> +	if (!sk_storage || hlist_empty(&sk_storage->list)) {
>>
>> Is it possible sk_storage is not NULL and hlist_empty(&sk_storage->list)
>> is true?
> Very unlikely but possible.  Note that the sk->sk_bpf_storage ptr is protected
> by atomic cmpxchg().
> 
>> Is this case, should we avoid allocation of new sk_storage at all?
> It should still try which has a good chance that the cmpxchg(..., NULL, ...)
> in sk_storage_alloc() will succeed.

Right. This is a rare situation. I guess there is no need to optimize 
for it.

> 
>>
>>> +		/* Very first elem for this sk */
>>> +		err = check_flags(NULL, map_flags);
>>> +		if (err)
>>> +			return ERR_PTR(err);
>>> +
>>> +		selem = selem_alloc(smap, sk, value, true);
>>> +		if (!selem)
>>> +			return ERR_PTR(-ENOMEM);
>>> +
>>> +		err = sk_storage_alloc(sk, smap, selem);
>>> +		if (err) {
>>> +			kfree(selem);
>>> +			atomic_sub(smap->elem_size, &sk->sk_omem_alloc);
>>
>> discharding sk->sk_omem_alloc already happens in sk_storage_alloc?
> They are discharging two different things.
> sk_storage_alloc() discharges the sk_storage.
> Here it dischages the selem.

Yes, make sense.

> 
>>
>>> +			return ERR_PTR(err);
>>> +		}
>>> +
>>> +		return SDATA(selem);
>>> +	}
>>> +
>>> +	if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) {
>>> +		/* Hoping to find an old_sdata to do inline update
>>> +		 * such that it can avoid taking the sk_storage->lock
>>> +		 * and changing the lists.
>>> +		 */
>>> +		old_sdata = __sk_storage_lookup(sk_storage, smap, false);
>>> +		err = check_flags(old_sdata, map_flags);
>>> +		if (err)
>>> +			return ERR_PTR(err);
>>> +		if (old_sdata && selem_linked_to_sk(SELEM(old_sdata))) {
>>> +			copy_map_value_locked(map, old_sdata->data,
>>> +					      value, false);
>>> +			return old_sdata;
>>> +		}
>>> +	}
>>> +
>>> +	raw_spin_lock_bh(&sk_storage->lock);
>>> +
>>> +	/* Recheck sk_storage->list under sk_storage->lock */
>>> +	if (unlikely(hlist_empty(&sk_storage->list))) {
>>> +		/* A parallel del is happening and sk_storage is going
>>> +		 * away.  It has just been checked before, so very
>>> +		 * unlikely.  Return instead of retry to keep things
>>> +		 * simple.
>>> +		 */
>>> +		err = -EAGAIN;
>>> +		goto unlock_err;
>>> +	}
>>> +
>>> +	old_sdata = __sk_storage_lookup(sk_storage, smap, false);
>>> +	err = check_flags(old_sdata, map_flags);
>>> +	if (err)
>>> +		goto unlock_err;
>>> +
>>> +	if (old_sdata && (map_flags & BPF_F_LOCK)) {
>>> +		copy_map_value_locked(map, old_sdata->data, value, false);
>>> +		selem = SELEM(old_sdata);
>>> +		goto unlock;
>>> +	}
>>> +
>>> +	/* sk_storage->lock is held.  Hence, we are sure
>>> +	 * we can unlink and uncharge the old_sdata successfully
>>> +	 * later.  Hence, instead of charging the new selem now
>>> +	 * and then uncharge the old selem later (which may cause
>>> +	 * a potential but unnecessary charge failure),  avoid taking
>>> +	 * a charge at all here (the "!old_sdata" check) and the
>>> +	 * old_sdata will not be uncharged later during __selem_unlink_sk().
>>> +	 */
>>> +	selem = selem_alloc(smap, sk, value, !old_sdata);
>>> +	if (!selem) {
>>> +		err = -ENOMEM;
>>> +		goto unlock_err;
>>> +	}
>>> +
>>> +	/* First, link the new selem to the map */
>>> +	selem_link_map(smap, selem);
>>> +
>>> +	/* Second, link (and publish) the new selem to sk_storage */
>>> +	__selem_link_sk(sk_storage, selem);
>>> +
>>> +	/* Third, remove old selem, SELEM(old_sdata) */
>>> +	if (old_sdata) {
>>> +		selem_unlink_map(SELEM(old_sdata));
>>> +		__selem_unlink_sk(sk_storage, SELEM(old_sdata), false);
>>> +	}
>>> +
>>> +unlock:
>>> +	raw_spin_unlock_bh(&sk_storage->lock);
>>> +	return SDATA(selem);
>>> +
>>> +unlock_err:
>>> +	raw_spin_unlock_bh(&sk_storage->lock);
>>> +	return ERR_PTR(err);
>>> +}
>>> +
>>> +static int sk_storage_delete(struct sock *sk, struct bpf_map *map)
>>> +{
>>> +	struct bpf_sk_storage_data *sdata;
>>> +
>>> +	sdata = sk_storage_lookup(sk, map, false);
>>> +	if (!sdata)
>>> +		return -ENOENT;
>>> +
>>> +	selem_unlink(SELEM(sdata));
>>> +
>>> +	return 0;
>>> +}
>>> +
>> [...]
>>> +static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
>>> +{
>>> +	struct bpf_sk_storage_map *smap;
>>> +	unsigned int i;
>>> +	u32 nbuckets;
>>> +	u64 cost;
>>> +
>>> +	if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries ||
>>> +	    attr->key_size != sizeof(int) || !attr->value_size ||
>>> +	    /* Enforce BTF for userspace sk dumping */
>>> +	    !attr->btf_key_type_id || !attr->btf_value_type_id)
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	if (!capable(CAP_SYS_ADMIN))
>>> +		return ERR_PTR(-EPERM);
>>> +
>>> +	if (attr->value_size >= KMALLOC_MAX_SIZE -
>>> +	    MAX_BPF_STACK - sizeof(struct bpf_sk_storage_elem) ||
>>> +	    /* selem->elem_size is a u16 */
>>> +	    attr->value_size > U16_MAX - sizeof(struct bpf_sk_storage_elem))
>>> +		return ERR_PTR(-E2BIG);
>>> +
>>> +	smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN);
>>> +	if (!smap)
>>> +		return ERR_PTR(-ENOMEM);
>>> +	bpf_map_init_from_attr(&smap->map, attr);
>>> +
>>> +	smap->bucket_log = ilog2(roundup_pow_of_two(num_possible_cpus()));
>>> +	nbuckets = 1U << smap->bucket_log;
>>> +	smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets,
>>> +				 GFP_USER | __GFP_NOWARN);
>>> +	if (!smap->buckets) {
>>> +		kfree(smap);
>>> +		return ERR_PTR(-ENOMEM);
>>> +	}
>>> +	cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap);
>>> +
>>> +	for (i = 0; i < nbuckets; i++) {
>>> +		INIT_HLIST_HEAD(&smap->buckets[i].list);
>>> +		raw_spin_lock_init(&smap->buckets[i].lock);
>>> +	}
>>> +
>>> +	smap->elem_size = sizeof(struct bpf_sk_storage_elem) + attr->value_size;
>>> +	smap->cache_idx = (unsigned int)atomic_inc_return(&cache_idx) %
>>> +		BPF_SK_STORAGE_CACHE_SIZE;
>>> +	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
>>
>> should we test bpf_map_precharge_memlock() here?
> I did not do it here because the size of this map does not depend on
> "map.max_entries".  Meaning,
> this map merely needs the memory for the buckets which is less likely to
> fail (and destroying in case of final charge failure is not that expensive
> either) which then makes this precheck not as useful as other maps.

I think it is okay as is as you described in the above.
Later on, a real charging will happen and if indeed running out of 
locked memory, it will fail then. But this should a rare thing.

> 
> Thanks for reviewing!
> 
>>
>>> +
>>> +	return &smap->map;
>>> +}
>>> +
>>> +static int notsupp_get_next_key(struct bpf_map *map, void *key,
>>> +				void *next_key)
>>> +{
>>> +	return -ENOTSUPP;
>>> +}
>>> +
>>> +static int bpf_sk_storage_map_check_btf(const struct bpf_map *map,
>>> +					const struct btf *btf,
>>> +					const struct btf_type *key_type,
>>> +					const struct btf_type *value_type)
>>> +{
>>> +	u32 int_data;
>>> +
>>> +	if (BTF_INFO_KIND(key_type->info) != BTF_KIND_INT)
>>> +		return -EINVAL;
>>> +
>>> +	int_data = *(u32 *)(key_type + 1);
>>> +	if (BTF_INT_BITS(int_data) != 32 || BTF_INT_OFFSET(int_data))
>>> +		return -EINVAL;
>>> +
>>> +	return 0;
>>> +}
>>> +
>> [...]> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index 443b98d05f1e..9773be724aa9 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -137,6 +137,7 @@
>>>    
>>>    #include <linux/filter.h>
>>>    #include <net/sock_reuseport.h>
>>> +#include <net/bpf_sk_storage.h>
>>>    
>>>    #include <trace/events/sock.h>
>>>    
>>> @@ -1709,6 +1710,10 @@ static void __sk_destruct(struct rcu_head *head)
>>>    
>>>    	sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP);
>>>    
>>> +#ifdef CONFIG_BPF_SYSCALL
>>> +	bpf_sk_storage_free(sk);
>>> +#endif
>>> +
>>>    	if (atomic_read(&sk->sk_omem_alloc))
>>>    		pr_debug("%s: optmem leakage (%d bytes) detected\n",
>>>    			 __func__, atomic_read(&sk->sk_omem_alloc));
>>>