Re: [PATCH 01/12] IP set core support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Patrick,

On Wed, 19 Jan 2011, Patrick McHardy wrote:

> On 15.01.2011 00:23, Jozsef Kadlecsik wrote:
> > The patch adds the IP set core support to the kernel.
> 
> Following are some comments based on a first review. I'll make another
> pass over this patch to have a closer look at the netlink protocol.

First of all, thanks for the review! A very terse "description" of the 
protocol can be found here: 
https://git.netfilter.org/cgi-bin/gitweb.cgi?p=ipset.git;a=blob_plain;f=lib/PROTOCOL;hb=7cd45445d55b14b7aca69a81b7815a98afc51784
 
> > The IP set core implements a netlink (nfnetlink) based protocol by which
> > one can create, destroy, flush, rename, swap, list, save, restore sets,
> > and add, delete, test elements from userspace. For simplicity (and backward
> > compatibilty and for not to force ip(6)tables to be linked with a netlink
> > library) reasons a small getsockopt-based protocol is also kept in order
> > to communicate with the ip(6)tables match and target.
> > 
> > The netlink protocol passes all u16, etc values in network order, but it
> > honors the NLA_F_NET_BYTEORDER flag. The protocol enforces the proper
> > use of the NLA_F_NESTED flag for its container type of attributes.
> > Backward/forward compatibility is supported with proper protocol (and set
> > type) version number.
> 
> I don't really see the benefit of accepting both byteorders, why not
> simply define the byteorder to big endian (with or without the
> NLA_F_NET_BYTEORDER flag)? 

It's a remnant of an early version: at the very beginning the set types 
used their "natural" byte-ordering, so both form could appear. I'll
keep it as big endiand only with the NLA_F_NET_BYTEORDER flag.

> The current code will trigger sparse
> endianess warnings when doing the byteorder conversion since the value
> is always temporarily stored in a uXX.
>
> Also the remaining code doesn't seem to make use of the endian specific
> types, so this will trigger a lot of sparse warnings as well.

True, I'll change it and use __be32, etc.
  
> > diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> > new file mode 100644
> > index 0000000..7003c78
> > --- /dev/null
> > +++ b/include/linux/netfilter/ipset/ip_set.h
> > @@ -0,0 +1,522 @@
> > +#ifndef _IP_SET_H
> > +#define _IP_SET_H
> > +
> > +...
> > +#ifdef __KERNEL__
> > +#include <linux/ip.h>
> > +#include <linux/ipv6.h>
> > +#include <linux/netlink.h>
> > +#include <linux/netfilter.h>
> > +#include <linux/vmalloc.h>
> > +#include <net/netlink.h>
> 
> I think it would fit better with our current header organization to move
> this part to include/net/netfilter/...

I pondered a lot on the proper place for the header and .c files. For the 
userspace utility "ipset" is a good name, it's parallel with "ip" which 
also handles INET and INET6. So the "ip_" prefix is just appropriate for 
the kernel parts as well. However, wouldn't look it strange the mixed 
files with "ip_" and "nf_" prefixes? Also, it's a compact subsystem, 
similarly to ipvs. So I thought an ipset/ subdir were the proper place for 
the stuff.
 
> > +/* Sets are identified by an index in kernel space. Tweak with ip_set_id_t
> > + * and IPSET_INVALID_ID if you want to increase the max number of sets.
> > + */
> > +typedef u16 ip_set_id_t;
> 
> This typedef does not really fit any of the "acceptable" cases of
> CodingStyle, but I see why you're using it.

Hm, it could be eliminated: with the exception of the list:set type, the 
typedef is used in the core and the xt_set.c match/target files only. 
65535 sets should be enough for anyone (;-), but if it ever needs to be 
changed, it's not a big deal.
 
> > +
> > +...
> > +/* Allocate members */
> > +static inline void *
> > +ip_set_alloc(size_t size, gfp_t gfp_mask)
> > +{
> > +	void *members = NULL;
> > +
> > +	if (size < KMALLOC_MAX_SIZE)
> > +		members = kzalloc(size, gfp_mask | __GFP_NOWARN);
> > +
> > +	if (members) {
> > +		pr_debug("%p: allocated with kmalloc", members);
> > +		return members;
> > +	}
> > +
> > +	members = __vmalloc(size, gfp_mask | __GFP_ZERO, PAGE_KERNEL);
> > +	if (!members)
> > +		return NULL;
> > +	pr_debug("%p: allocated with vmalloc", members);
> > +
> > +	return members;
> 
> This looks too large to be inlined and is probably also not particulary
> performance critical. I'd suggest to move the allocation and free
> function to a .c file.

OK.

> > ...
> > +static const struct nla_policy ipaddr_policy[IPSET_ATTR_IPADDR_MAX + 1] = {
> > +	[IPSET_ATTR_IPADDR_IPV4]	= { .type = NLA_U32 },
> > +	[IPSET_ATTR_IPADDR_IPV6]	= { .type = NLA_BINARY,
> > +					    .len = sizeof(struct in6_addr) },
> > +};
> > +
> > +static inline int
> > +ip_set_get_ipaddr4(struct nlattr *attr[], int type, u32 *ipaddr)
> > +{
> > +	struct nlattr *tb[IPSET_ATTR_IPADDR_MAX+1] = {};
> 
> The initialization is not necessary, nla_parse already clears
> the array.

Oh, right.
 
> > +
> > +	if (!attr[type])
> > +		return -IPSET_ERR_PROTOCOL;
> > +
> > +	if (nla_parse(tb, IPSET_ATTR_IPADDR_MAX,
> > +		      nla_data(attr[type]), nla_len(attr[type]),
> > +		      ipaddr_policy))
> 
> This is exactly what nla_parse_nested() does.
> 
> > +		return -IPSET_ERR_PROTOCOL;
> > +	if (!tb[IPSET_ATTR_IPADDR_IPV4])
> > +		return -IPSET_ERR_IPADDR_IPV4;
> > +
> > +	*ipaddr = ip_set_get_n32(tb[IPSET_ATTR_IPADDR_IPV4]);
> > +	return 0;
> > +}
> 
> This also looks to large for inlining, also you're including the
> nla_policy array in every file using this.

Right again, I'll move it into the core, and the function below as well.
 
> > +
> > +static inline int
> > +ip_set_get_ipaddr6(struct nlattr *attr[], int type, union nf_inet_addr *ipaddr)
> > +{
> 
> Same comments as to ip_set_get_ipaddr4 apply.
> 
> > ...
> > +
> > +#define NLA_PUT_NET32(skb, type, value)	\
> > +	NLA_PUT_BE32(skb, type | NLA_F_NET_BYTEORDER, value)
> > +
> > +#define NLA_PUT_NET16(skb, type, value)	\
> > +	NLA_PUT_BE16(skb, type | NLA_F_NET_BYTEORDER, value)
> 
> These could maybe go into include/net/netlink.h.

OK, I'll create a patch for this.
 
> > +/* Get address from skbuff */
> > +static inline u32
> > +ip4addr(const struct sk_buff *skb, bool src)
> 
> This and other related functions should be using the endian specific
> types.

Yes, true, I'll do the required changes.
 
> > ...
> > +
> > +/* Interface to iptables/ip6tables */
> > +
> > +#define SO_IP_SET		83
> 
> Just wondering if we should define this in relation to IP_BASE_CTL
> or something like that to avoid possible clashes in the future.

Or make it a module parameter?
 
> > diff --git a/include/linux/netfilter/ipset/ip_set_kernel.h b/include/linux/netfilter/ipset/ip_set_kernel.h
> > new file mode 100644
> > index 0000000..d770589
> > --- /dev/null
> > +++ b/include/linux/netfilter/ipset/ip_set_kernel.h
> > @@ -0,0 +1,15 @@
> > +#ifndef _IP_SET_KERNEL_H
> > +#define _IP_SET_KERNEL_H
> > +
> > +#ifdef __KERNEL__
> > +
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +/* Complete debug messages */
> > +#define pr_fmt(fmt) "%s %s[%i]: " fmt "\n", __FILE__, __func__, __LINE__
> > +#endif
> > +
> > +#include <linux/kernel.h>
> 
> Is this really still needed? Also include/net/netfilter would be a
> better place for a purely kernel-internal header (same comment for
> all other internal headers).

No, that was useful at the debugging. I drop it.
 
> > +
> > +#endif	/* __KERNEL__ */
> > +
> > +#endif /*_IP_SET_H */
> > diff --git a/include/linux/netfilter/ipset/pfxlen.h b/include/linux/netfilter/ipset/pfxlen.h
> > new file mode 100644
> > index 0000000..fe7153c
> > --- /dev/null
> > +++ b/include/linux/netfilter/ipset/pfxlen.h
> > @@ -0,0 +1,16 @@
> > +#ifndef _PFXLEN_H
> > +#define _PFXLEN_H
> > +
> > +#include <asm/byteorder.h>
> > +#include <linux/netfilter.h> 
> > +
> > +/* Prefixlen maps, by Jan Engelhardt  */
> > +extern const union nf_inet_addr prefixlen_netmask_map[];
> > +extern const union nf_inet_addr prefixlen_hostmask_map[];
> > +
> > +#define NETMASK(n)	prefixlen_netmask_map[n].ip
> > +#define NETMASK6(n)	prefixlen_netmask_map[n].ip6
> > +#define HOSTMASK(n)	prefixlen_hostmask_map[n].ip
> > +#define HOSTMASK6(n)	prefixlen_hostmask_map[n].ip6
> 
> These names seem a bit to generic.

PREFIXLEN2NETMAKS are just overly long. PFLEN2NETMASK?

> > +
> > +#endif /*_PFXLEN_H */
> > --- /dev/null
> > +++ b/net/netfilter/ipset/ip_set_core.c
> > @@ -0,0 +1,1556 @@
> > +/* Copyright (C) 2000-2002 Joakim Axelsson <gozem@xxxxxxxx>
> > + *                         Patrick Schaaf <bof@xxxxxx>
> > + * Copyright (C) 2003-2011 Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +/* Kernel module for IP set management */
> > +
> > +#include <linux/netfilter/ipset/ip_set_kernel.h>
> > +#include <linux/init.h>
> > +#include <linux/module.h>
> > +#include <linux/moduleparam.h>
> > +#include <linux/ip.h>
> > +#include <linux/skbuff.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/netlink.h>
> > +#include <linux/rculist.h>
> > +#include <net/netlink.h>
> > +
> > +#include <linux/netfilter.h>
> > +#include <linux/netfilter/nfnetlink.h>
> > +#include <linux/netfilter/ipset/ip_set.h>
> > +
> > +static struct list_head ip_set_type_list;	/* all registered set types */
> 
> static LIST_HEAD(..) avoids the need for explicit initialization
> during runtime.

OK.
 
> > +static DEFINE_MUTEX(ip_set_type_mutex);		/* protects ip_set_type_list */
> > +
> > +static struct ip_set **ip_set_list;		/* all individual sets */
> > +static ip_set_id_t ip_set_max = CONFIG_IP_SET_MAX; /* max number of sets */
> > +
> > +#define STREQ(a, b)	(strncmp(a, b, IPSET_MAXNAMELEN) == 0)
> > +
> > +static int max_sets;
> > +
> > +module_param(max_sets, int, 0600);
> 
> unsigned int?

Why it's int at the first place??
 
> > +MODULE_PARM_DESC(max_sets, "maximal number of sets");
> > +MODULE_LICENSE("GPL");
> > +MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxxxxxx>");
> > +MODULE_DESCRIPTION("core IP set support");
> > +MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_IPSET);
> > +
> > +/*
> > + * The set types are implemented in modules and registered set types
> > + * can be found in ip_set_type_list. Adding/deleting types is
> > + * serialized by ip_set_type_mutex.
> > + */
> > +
> > +static inline void
> > +ip_set_type_lock(void)
> > +{
> > +	mutex_lock(&ip_set_type_mutex);
> > +}
> > +
> > +static inline void
> > +ip_set_type_unlock(void)
> > +{
> > +	mutex_unlock(&ip_set_type_mutex);
> > +}
> > +
> > +/* Register and deregister settype */
> > +
> > +static inline struct ip_set_type *
> > +find_set_type(const char *name, u8 family, u8 revision)
> 
> Except when the function is really small and performance critical,
> its usually better to let the compiler decide based on the options
> given (like optimize for size).

OK.
 
> > +{
> > +	struct ip_set_type *type;
> > +
> > +	list_for_each_entry_rcu(type, &ip_set_type_list, list)
> > +		if (STREQ(type->name, name)
> > +		    && (type->family == family || type->family == AF_UNSPEC)
> > +		    && type->revision == revision)
> 
> This "&&" and "||" continuation style is used throughout the entire
> ipset code. We had a bunch of patches a while ago to get rid of this
> in the entire net/ tree, so I'd prefer not to reintroduce this.

Oh, I'll fix that everywhere.
 
> > +			return type;
> > +	return NULL;
> > +}
> > +
> 
> > +static inline void
> > +load_type_module(const char *typename)
> > +{
> > +	pr_debug("try to load ip_set_%s", typename);
> > +	request_module("ip_set_%s", typename);
> > +}
> > +
> > +static int
> > +ip_set_create(struct sock *ctnl, struct sk_buff *skb,
> > +	      const struct nlmsghdr *nlh,
> > +	      const struct nlattr * const attr[])
> > +{
> > +	struct ip_set *set, *clash;
> > +	ip_set_id_t index = IPSET_INVALID_ID;
> > +	const char *name, *typename;
> > +	u8 family, revision;
> > +	u32 flags = flag_exist(nlh);
> > +	int ret = 0, len;
> > +
> > +	if (unlikely(protocol_failed(attr)
> > +		     || attr[IPSET_ATTR_SETNAME] == NULL
> > +		     || attr[IPSET_ATTR_TYPENAME] == NULL
> > +		     || attr[IPSET_ATTR_REVISION] == NULL
> > +		     || attr[IPSET_ATTR_FAMILY] == NULL
> > +		     || (attr[IPSET_ATTR_DATA] != NULL
> > +			 && !flag_nested(attr[IPSET_ATTR_DATA]))))
> > +		return -IPSET_ERR_PROTOCOL;
> > +
> > +	name = nla_data(attr[IPSET_ATTR_SETNAME]);
> > +	typename = nla_data(attr[IPSET_ATTR_TYPENAME]);
> > +	family = nla_get_u8(attr[IPSET_ATTR_FAMILY]);
> > +	revision = nla_get_u8(attr[IPSET_ATTR_REVISION]);
> > +	pr_debug("setname: %s, typename: %s, family: %s, revision: %u",
> > +		 name, typename, family_name(family), revision);
> > +
> > +	/*
> > +	 * First, and without any locks, allocate and initialize
> > +	 * a normal base set structure.
> > +	 */
> > +	set = kzalloc(sizeof(struct ip_set), GFP_KERNEL);
> > +	if (!set)
> > +		return -ENOMEM;
> > +	rwlock_init(&set->lock);
> > +	strlcpy(set->name, name, IPSET_MAXNAMELEN);
> > +	atomic_set(&set->ref, 0);
> > +	set->family = family;
> > +
> > +	/*
> > +	 * Next, check that we know the type, and take
> > +	 * a reference on the type, to make sure it stays available
> > +	 * while constructing our new set.
> > +	 *
> > +	 * After referencing the type, we try to create the type
> > +	 * specific part of the set without holding any locks.
> > +	 */
> > +	set->type = find_set_type_rcu(typename, family, revision);
> > +	if (set->type == NULL) {
> > +		/* Try loading the module */
> > +		load_type_module(typename);
> 
> We holding the nfnl_mutex here, so you can't perform a module load.
> It needs to release the mutex, perform the module load, grab the
> mutex again and redo all checks that depend on the mutex. Check
> out how ctnetlink does it (return -EAGAIN after successful load and
> have nfnetlink replay the entire request).

Ooops, good catch, I'll rewrite it that way.
 
> > +		set->type = find_set_type_rcu(typename, family, revision);
> > +		if (set->type == NULL) {
> > +			pr_warning("Can't find ip_set type %s, family %s, "
> > +				   "revision %u: set '%s' not created",
> > +				   typename, family_name(family), revision,
> > +				   name);
> > +			ret = -IPSET_ERR_FIND_TYPE;
> > +			goto out;
> > +		}
> > +	}
> > +	if (!try_module_get(set->type->me)) {
> > +		rcu_read_unlock();
> > +		ret = -EFAULT;
> > +		goto out;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	...
> > +}
> > +
> > +
> > +/* Rename a set */
> > +
> > +static const struct nla_policy
> > +ip_set_setname2_policy[IPSET_ATTR_CMD_MAX + 1] = {
> > +	[IPSET_ATTR_PROTOCOL]	= { .type = NLA_U8 },
> > +	[IPSET_ATTR_SETNAME]	= { .type = NLA_NUL_STRING,
> > +				    .len = IPSET_MAXNAMELEN - 1 },
> > +	[IPSET_ATTR_SETNAME2]	= { .type = NLA_NUL_STRING,
> > +				    .len = IPSET_MAXNAMELEN - 1 },
> > +};
> > +
> > +static int
> > +ip_set_rename(struct sock *ctnl, struct sk_buff *skb,
> > +	      const struct nlmsghdr *nlh,
> > +	      const struct nlattr * const attr[])
> > +{
> > +	struct ip_set *set;
> > +	const char *name2;
> > +	ip_set_id_t i;
> > +
> > +	if (unlikely(protocol_failed(attr)
> > +		     || attr[IPSET_ATTR_SETNAME] == NULL
> > +		     || attr[IPSET_ATTR_SETNAME2] == NULL))
> > +		return -IPSET_ERR_PROTOCOL;
> > +
> > +	set = find_set(nla_data(attr[IPSET_ATTR_SETNAME]));
> > +	if (set == NULL)
> > +		return -EEXIST;
> > +	if (atomic_read(&set->ref) != 0)
> > +		return -IPSET_ERR_REFERENCED;
> > +
> > +	name2 = nla_data(attr[IPSET_ATTR_SETNAME2]);
> > +	for (i = 0; i < ip_set_max; i++) {
> > +		if (ip_set_list[i] != NULL
> > +		    && STREQ(ip_set_list[i]->name, name2))
> 
> There's also a nla_strcmp function, which can take the netlink
> attribute as parameter directly.
> 
> > +			return -IPSET_ERR_EXIST_SETNAME2;
> > +	}
> > +	strncpy(set->name, name2, IPSET_MAXNAMELEN);
> 
> And a nla_strlcpy() function.

I'll use nla_strcmp and nla_strlcpy. 

Thanks again, I fix the issues above and re-send the patches.

Best regards,
Jozsef
-
E-mail  : kadlec@xxxxxxxxxxxxxxxxx, kadlec@xxxxxxxxxxxx
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : KFKI Research Institute for Particle and Nuclear Physics
          H-1525 Budapest 114, POB. 49, Hungary
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Netfitler Users]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux