Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > On Sat, Sep 30, 2023 at 10:10:38AM +0200, Florian Westphal wrote: > > It does. If lockless doesn't return a match it falls back to readlock. > > And it will in case of the update to remove the expired element, right? Yes, but I don't think its a huge problem, this is short and we do not grab wrlock until we actually unlink. > > > it might be possible to simply grab the spinlock > > > in write mode and release entries inmediately, without requiring the > > > sync GC batch infrastructure that pipapo is using. > > > > Is there evidence that the on-demand GC is a problem? > > After your last fix, not really, other than we have to be care not to > zap elements that are in any of the pending transaction in this batch. Right. I don't see a problem with zapping those (newly added ones) from the point-of-no-return phase, however, we don't have to worry about rollback then. > > It only searches in the relevant subtree, it should rarely, if ever, > > encounter any expired element. > > Not a problem now, but this path that we follow blocks a requested > feature. > > I currently do not know how to support for set element timeout update > with this on-demand GC on the rbtree. This feature will require a new > update state in the transaction to update the timer for an element. We > will have to be careful because on-demand GC might zap an expired > element that got just refreshed in this transaction (unless we > reintroduce some sort of 'busy' bit for updates again which is > something I prefer we do not). With 2ee52ae94baa ("netfilter: > nft_set_rbtree: skip sync GC for new elements in this transaction") > I could fix by looking at the generation mask to infer that this > element is 'busy' by some pending transaction, but I do not see a way > with an element update command in place. Not convinced. In particular, I don't understand what this has to do with async vs. sync gc. To me this depends how we want to handle elements that have expired or expire during transaction. Example: Element E1, times out in 1 hour Element E2, times out in 1 second Element E3, timed out (1 second ago, 3 minutes ago, doesn't matter). Userspace batch to kernel: Update Element E1 to time out in 2 hours. Update Element E2 to time out in 1 hour. Update Element E3 to time out in 1 hour. What is the expected outcome of this request? Ignore E3 being reaped already and refresh the timeout (resurrection?) Ignore E3 being reaped already and ignore the request? Fail the transaction as E3 timed out already ("does not exist")? Now, what about E2? If transaction is large, it could become like E3 *during the transaction* unless we introduce some freezing mechanism. Whats the expected outcome? Whats the expected outcome if there is some other, unrelated failure? I assume we have to roll back all the timeout updates, right? If so, why not temporarily make the timeouts effective right away and then roll back? What if userspace asks to *shrink* timeouts? Same scenario: Update Element E1 to time out in 1 second Update Element E2 to time out in 1 second Update Element E3 to time out in 1 second We could say 'not supported' of course, would avoid the rollback headache, because in this case long transaction gc would definitely start to pick some of those elements up for reaping... > Maybe take your patch and then follow up to nf-next with this approach > based once set element timer update is introduced? Otherwise, rbtree > will block the introduction of this new feature for other sets too. See above, I would first like to understand the expected semantics of the timeout update facility. I've pushed a (not very much tested) version of gc overhaul to passive lookups based on expiry candidates, this removes the need for gc sequence counters. Its vs. nf.git but really should be re-targetted to nf-next, I'll try to do this next week: https://git.kernel.org/pub/scm/linux/kernel/git/fwestphal/nf.git/log/?h=nft_set_gc_query_08