On Tue, Jul 23, 2024 at 02:57:07PM +0200, Pablo Neira Ayuso wrote: > On Tue, Jul 23, 2024 at 02:19:25PM +0200, Pablo Neira Ayuso wrote: > > On Tue, Jul 23, 2024 at 01:56:46PM +0200, Phil Sutter wrote: > > > Some digging and lots of printf's later: > > > > > > On Mon, Jul 22, 2024 at 11:34:01PM +0200, Pablo Neira Ayuso wrote: > > > [...] > > > > I can reproduce it: > > > > > > > > # nft -i > > > > nft> add table inet foo > > > > nft> add chain inet foo bar { type filter hook input priority filter; } > > > > nft> add rule inet foo bar accept > > > > > > This bumps cache->flags from 0 to 0x1f (no cache -> NFT_CACHE_OBJECT). > > > > > > > nft> insert rule inet foo bar index 0 accept > > > > > > This adds NFT_CACHE_RULE_BIT and NFT_CACHE_UPDATE, cache is updated (to > > > fetch rules). > > > > > > > nft> add rule inet foo bar index 0 accept > > > > > > No new flags for this one, so the code hits the 'genid == cache->genid + > > > 1' case in nft_cache_is_updated() which bumps the local genid and skips > > > a cache update. The new rule then references the cached copy of the > > > previously commited one which still does not have a handle. Therefore > > > link_rules() does it's thing for references to uncommitted rules which > > > later fails. > > > > > > Pablo: Could you please explain the logic around this cache->genid > > > increment? Commit e791dbe109b6d ("cache: recycle existing cache with > > > incremental updates") is not clear to me in this regard. How can the > > > local process know it doesn't need whatever has changed in the kernel? > > > > The idea is to use the ruleset generation ID as a hint to infer if the > > existing cache can be recycled, to speed up incremental updates. This > > is not sufficient for the index cache, see below. > > I have to revisit e791dbe109b6d, another process could race to bump > the generation ID incrementally and I incorrectly assumed cache is > consistent. It might be fine, because cache->genid != 0 means we have fetched from kernel previously and thus also committed a change (list commands set CACHE_REFRESH). Kernel genid is expectedly cache->genid + 1, a concurrent commit would bump again. I don't like the commit because it breaks with the assumption that kernel genid matching cache genid means cache is up to date. It may indeed be, but I think it's thin ice and caching code is pretty complex as-is. :/ Cheers, Phil