Re: [Patch bpf-next v5 1/3] bpf: introduce timeout hash map

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Tue, 26 Jan 2021 23:04:57 +0100




On 1/22/21 9:54 PM, Cong Wang wrote:
From: Cong Wang <cong.wang@xxxxxxxxxxxxx>

This borrows the idea from conntrack and will be used for conntrack in
ebpf too. Each element in a timeout map has a user-specified timeout
in msecs, after it expires it will be automatically removed from the
map. Cilium already does the same thing, it uses a regular map or LRU
map to track connections and has its own GC in user-space. This does
not scale well when we have millions of connections, as each removal
needs a syscall. Even if we could batch the operations, it still needs
to copy a lot of data between kernel and user space.

There are two cases to consider here:

1. When the timeout map is idle, i.e. no one updates or accesses it,
    we rely on the delayed work to scan the whole hash table and remove
    these expired. The delayed work is scheduled every 1 sec when idle,
    which is also what conntrack uses. It is fine to scan the whole
    table as we do not actually remove elements during this scan,
    instead we simply queue them to the lockless list and defer all the
    removals to the next schedule.

2. When the timeout map is actively accessed, we could reach expired
    elements before the idle work automatically scans them, we can
    simply skip them and schedule the delayed work immediately to do
    the actual removals. We have to avoid taking locks on fast path.

The timeout of an element can be set or updated via bpf_map_update_elem()
and we reuse the upper 32-bit of the 64-bit flag for the timeout value,
as there are only a few bits are used currently. Note, a zero timeout
means to expire immediately.

To avoid adding memory overhead to regular map, we have to reuse some
field in struct htab_elem, that is, lru_node. Otherwise we would have
to rewrite a lot of code.

For now, batch ops is not supported, we can add it later if needed.

Back in earlier conversation [0], I mentioned also LRU map flavors and to look
into adding a flag, so we wouldn't need new BPF_MAP_TYPE_TIMEOUT_HASH/*LRU types
that replicate existing types once again just with the timeout in addition, so
UAPI wise new map type is not great.

Given you mention Cilium above, only for kernels where there is no LRU hash map,
that is < 4.10, we rely on plain hash, everything else LRU + prealloc to mitigate
ddos by refusing to add new entries when full whereas less active ones will be
purged instead. Timeout /only/ for plain hash is less useful overall, did you
sketch a more generic approach in the meantime that would work for all the htab/lru
flavors (and ideally as non-delayed_work based)?

  [0] https://lore.kernel.org/bpf/20201214201118.148126-1-xiyou.wangcong@xxxxxxxxx/

[...]
@@ -1012,6 +1081,8 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
  			copy_map_value_locked(map,
  					      l_old->key + round_up(key_size, 8),
  					      value, false);
+			if (timeout_map)
+				l_old->expires = msecs_to_expire(timeout);
  			return 0;
  		}
  		/* fall through, grab the bucket lock and lookup again.
@@ -1020,6 +1091,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
  		 */
  	}
  
+again:
  	ret = htab_lock_bucket(htab, b, hash, &flags);
  	if (ret)
  		return ret;
@@ -1040,26 +1112,41 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
  		copy_map_value_locked(map,
  				      l_old->key + round_up(key_size, 8),
  				      value, false);
+		if (timeout_map)
+			l_old->expires = msecs_to_expire(timeout);
  		ret = 0;
  		goto err;
  	}
  
  	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
-				l_old);
+				timeout_map, l_old);
  	if (IS_ERR(l_new)) {
-		/* all pre-allocated elements are in use or memory exhausted */
  		ret = PTR_ERR(l_new);
+		if (ret == -EAGAIN) {
+			htab_unlock_bucket(htab, b, hash, flags);
+			htab_gc_elem(htab, l_old);
+			mod_delayed_work(system_unbound_wq, &htab->gc_work, 0);
+			goto again;

Also this one looks rather worrying, so the BPF prog is stalled here, loop-waiting
in (e.g. XDP) hot path for system_unbound_wq to kick in to make forward progress?

+		}
+		/* all pre-allocated elements are in use or memory exhausted */
  		goto err;
  	}
  
+	if (timeout_map)
+		l_new->expires = msecs_to_expire(timeout);
+