On Sat, Jan 25, 2014 at 05:14:51PM +0000, Patrick McHardy wrote: > On Sat, Jan 25, 2014 at 04:35:33PM +0000, Patrick McHardy wrote: > > On Sat, Jan 25, 2014 at 01:55:52PM +0000, Patrick McHardy wrote: > > > On Sat, Jan 25, 2014 at 02:03:51PM +0100, Pablo Neira Ayuso wrote: > > > > We still have a bug somewhere else. When creating 10000 rules like: > > > > tcp dport { 22, 23 }, I can see more than 10000 sets. > > > > > > > > # ./nft-set-get ip | wc -l > > > > 10016 > > > > > > > > It seems set 511 is not being used. See below: > > > > > > > > # ./nft-rule-get > > > > ip filter output 513 512 > > > > [ payload load 1b @ network header + 9 => reg 1 ] > > > > [ cmp eq reg 1 0x00000006 ] > > > > [ payload load 2b @ transport header + 2 => reg 1 ] > > > > [ lookup reg 1 set set510 ] > > > > [ counter pkts 0 bytes 0 ] > > > > > > > > ip filter output 514 513 > > > > [ payload load 1b @ network header + 9 => reg 1 ] > > > > [ cmp eq reg 1 0x00000006 ] > > > > [ payload load 2b @ transport header + 2 => reg 1 ] > > > > [ lookup reg 1 set set512 ] > > > > [ counter pkts 0 bytes 0 ] > > > > > > > > It seems to happen every 512 sets are added. Still investigating, so > > > > this needs a second follow up patch to resolve what Arturo is reporting. > > > > > > Yeah, we seem to have a couple of bugs in nf_tables_set_alloc_name(). > > > I'll fix them up and will then have a look at this patch. > > > > I can't reproduce the gaps in the name space, but we have an obvious > > overflow since we're using BITS_PER_LONG * PAGE_SIZE instead of BITS_PER_BYTE. > > > > This shouldn't have affected your test case though since the overflow only > > happens for more than 32768 sets. > > > > As a start, please try this patch. It fixes the overflow, might also > fix your problem. > > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c > index 9ce3053..e8c7437 100644 > --- a/net/netfilter/nf_tables_api.c > +++ b/net/netfilter/nf_tables_api.c > @@ -1989,13 +1992,13 @@ static int nf_tables_set_alloc_name(struct nft_ctx *ctx, struct nft_set *set, > > if (!sscanf(i->name, name, &tmp)) > continue; > - if (tmp < 0 || tmp > BITS_PER_LONG * PAGE_SIZE) > + if (tmp < 0 || tmp >= BITS_PER_BYTE * PAGE_SIZE) > continue; > > set_bit(tmp, inuse); > } > > - n = find_first_zero_bit(inuse, BITS_PER_LONG * PAGE_SIZE); > + n = find_first_zero_bit(inuse, BITS_PER_BYTE * PAGE_SIZE); > free_page((unsigned long)inuse); > } > Tested this patch, it works fine here, I hit -EMFILE with 32768 sets with no crashes. The problem I was reporting was different though, I found a bug in the batching code of libmnl. The mnl_nlmsg_batch_next function was not accounting the last message not fitting in the batch. With my patch + libmnl patch I can perform: nft -f pablo-lots-test; nft flush table filter; nft delete chain filter output; nft delete table filter without seeing unused anonymous sets left attached to the table and -EBUSY problems in that table. > Another thing is that our name allocation algorithm really sucks. It > was copied from dev_alloc_name(), but network device allocation doesn't > happen on the same scale as we might have. I'm considering switching to > something taking O(1). Basically, the name allocation is only useful for > anonymous sets anyway since in all other cases you need to manually populate > them. So if we switch to a prefix string that can't clash with user defined > names, we can simply use an incrementing 64 bit counter. So my > proposal would be to just use names starting with \0. Alternatively use a > handle instead of a name for anonymous sets. > > The second upside is that its not possible anymore for the user to run > into unexpected EEXIST when using setN or mapN as name. > > Thoughts? I like the u64 handle for anonymous sets, it's similar to what we do with other objects, we get O(1) handle allocation. I think we can allow both u64 and set%d, map%d. The kernel can check if the handle is available first, if not check if the name looks like set%d, map%d (so the the maximum number of sets limitation only applies to that case). Userspace only needs to send both set%d and the u64 handle. Would you be OK with that?
diff --git a/src/nlmsg.c b/src/nlmsg.c index fdb7af8..0a414a7 100644 --- a/src/nlmsg.c +++ b/src/nlmsg.c @@ -484,14 +484,15 @@ EXPORT_SYMBOL(mnl_nlmsg_batch_stop); bool mnl_nlmsg_batch_next(struct mnl_nlmsg_batch *b) { struct nlmsghdr *nlh = b->cur; + bool ret = true; if (b->buflen + nlh->nlmsg_len > b->limit) { b->overflow = true; - return false; + ret = false; } b->cur = b->buf + b->buflen + nlh->nlmsg_len; b->buflen += nlh->nlmsg_len; - return true; + return ret; } EXPORT_SYMBOL(mnl_nlmsg_batch_next);