Hi Jan,
I have trimmed the CC to netfilter, I don't think this deserves
attention to users, not yet at least.
Some quick impressions on your proposal:
On 24/11/10 23:29, Jan Engelhardt wrote:
By request of Pablo, I am posting the Xtables2 Netlink interface
specification for review. Additionally, further documentation and
toolchain around it is available through the temporary project page at
http://jengelh.medozas.de/projects/xtables/
which currently includes
* User Documentation Chapter 1: Architectural Differences
* Developer Documentation Part 1: Netlink interface (WIP)
This is copied below to facilitate inline replies
* Runnable Linux source tree
* Runnable userspace library (libnetfilter_xtables)
with small test-and-debug program
--8<--
Netlink interface
1 General use
1.1 Socket
Xtables2 is usable through a Netlink socket of type
NETLINK_XTABLES. No intermediate subsystem like nfnetlink is
used, because the kernel's nfnetlink parser does not make all
attributes available to (in-kernel) nfnetlink users.
#include<sys/socket.h>
#include<linux/netlink.h>
#define NETFILTER_XTABLES 21
nfxt_socket = socket(AF_NETLINK, SOCK_RAW, NETFILTER_XTABLES);
The NETLINK_XTABLES constant is defined in linux/netlink.h with
the value 21.
This has to go upon nfnetlink as other netfilter subsystems.
1.2 Message format
All messages transmitted over the Netlink socket are to have the
base struct nlmsghdr header, followed by a version tag to allow
for the flexibility of data following it:
struct xtnetlink_genhdr {
uint32_t version;
};
The version member is always 0 in the current implementation.
Following the genhdr can be any number of standard Netlink
attributes (struct nlattr plus their payload).
Often, a logical tree structure is used to describe something,
such as for example tables of chains of rules:
filter
\__ INPUT
| \__ some rule
\__ FORWARD
| \__ rule2
| \__ rule3
| \__ rule4
\__ OUTPUT
\__ rule5
\__ rule6
For this document, child objects are always ânestedâ within a
parent object, irrespective of the serialized encoding.
There are different ways to encode such a tree structure into a
serialized stream. In many Netlink protocols, children attributes
are encapsulated (a. k. a. ânestedâ, though we will avoid this
term to avoid double-use) and treated as a whole as a parent's
opaque data. We will call this format âEncapsulated Encodingâ.
To encode an attribute's length, struct nlattr only has a 16-bit
field, which means the attribute header plus payload is limited
to 64 KB. This is easily exceedable with the encapsulated
encoding as chains are collected rules in a chain, for example.
The problem is aggreviated by the kernel's Netlink handler only
allocating skbs a page size worth, which in the worst case means
that the usable payload for attributes is around 3600 bytes only.
In light of xt_u32's private data block being 1984 bytes already,
that means that you won't be able to fit two -m u32 invocations
nested in a single rule into a dump.
>
The Xtables2 Netlink protocol however encodes each node as a
standalone attribute, to be called Flat Encoding, that is
appended (a. k. a. âchainedâ) to the data stream. This makes it
possible to split requests and dumps at a finer level than
encapsulation would. Above all, it gets extensions the guarantee
to have data blocks of a minimum guaranteed size.
>
Since Netlink messages do have a 32-bit quantity to store the
message length, rulesets of roughly up to 4 GB are possibile,
which is currently regarded as sufficient. The largest (and
meaningful) rulesets seen to date in the industry weighed in at
approximately 150 MB.
You can split data into several messages and avoid this limitation.
Whereas attribute nesting automatically provided for boundaries,
this is realized using a dummy attribute in the chained approach.
Certain attributes can start such a flattened nesting, and
NFXTA_STOP terminates it.
I don't like this trailing attribute, see below.
2 Attributes
The meaning of attributes depends upon the nesting level in which
they appear. Their type however remains the same, such that a
single Netlink attribute validation policy object (struct
nla_policy) is sufficient.
A table of all known attributes:
+--------+-----------------+---------------+----------------+
| Value | Mnemonic | C type | NLA type |
+--------+-----------------+---------------+----------------+
+--------+-----------------+---------------+----------------+
| 1 | NFXTA_STOP | | NLA_FLAG |
+--------+-----------------+---------------+----------------+
| 2 | NFXTA_ERRNO | int | NLA_U32 |
+--------+-----------------+---------------+----------------+
| 3 | NFXTA_NAME | char [] | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
| 4 | NFXTA_CHAIN | | NLA_FLAG |
+--------+-----------------+---------------+----------------+
| 5 | NFXTA_HOOKNUM | unsigned int | NLA_U32 |
+--------+-----------------+---------------+----------------+
| 6 | NFXTA_PRIORITY | int | NLA_U32 |
+--------+-----------------+---------------+----------------+
| 7 | NFXTA_NFPROTO | uint8_t | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_RULE | | NLA_FLAG |
+--------+-----------------+---------------+----------------+
| | NFXTA_OFFSET | unsigned int | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_LENGTH | size_t | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_VERDICT | unsigned int | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_MATCH | | NLA_FLAG |
+--------+-----------------+---------------+----------------+
| | NFXTA_DATA | | NLA_BINARY |
+--------+-----------------+---------------+----------------+
| | NFXTA_TARGET | | NLA_FLAG |
+--------+-----------------+---------------+----------------+
| | NFXTA_JUMP | char [] | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
| | NFXTA_GOTO | char [] | NLA_NUL_STRING |
+--------+-----------------+---------------+----------------+
| | NFXTA_REVISION | uint8_t | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_SIZE | size_t | NLA_U32 |
+--------+-----------------+---------------+----------------+
| | NFXTA_HOOKMASK | unsigned int | NLA_U32 |
+--------+-----------------+---------------+----------------+
The kernel ignores attributes with value 0 during validation, so
it was left unused.
2.1 Nest level terminator<sub:nfxta_stop>
This attribute serves to denote the end of a nesting level as
introduced by NFXTA_CHAIN, NFXTA_RULE, NFXTA_MATCH or
NFXTA_TARGET. It has no data portion.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_STOP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
It's not a good idea to make assumptions on the order of the TLVs in a
Netlink message. I mean, you should not assume that NFXTA_STOP comes
after one specific attribute.
2.2 Dump error code<sub:nfxta_errno>
Once a NLM_F_MULTI dump operation has been started, for example
with the NFXTM_CHAIN_DUMP request, Netlink kernel users must
always end it successfully with NLMSG_DONE. To convey an error
during the dump, Xtables2 will emit a NFXTA_ERRNO attribute into
the stream (if it can), emit no further attributes for the
request, and cause the dump to stop.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8 | nla_type = NFXTA_ERRNO |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| int errno; |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Isn't nlmsg_err OK for your needs?
2.3 Match extension<sub:nfxta_match>
Invocation of a match is represented using the NFXTA_MATCH
attribute which starts a nest level. A match attribute must
contain two attributes:
â NFXTA_NAME: the name of the target extension
â NFXTA_DATA: data private to this instance of the extension
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_MATCH |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload | nla_type = NFXTA_NAME |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. name of the extension, e.g. "hashlimit" .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload | nla_type = NFXTA_DATA |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. e.g. struct xt_hashlimit_info
This is fine during some transition period, but Netlink protocols must
not encapsulate structures in the payload of their TLVs.
.
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_STOP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
2.4 Target extension<sub:nfxta_target>
Invocation of a match is represented using the NFXTA_TARGET
attribute which starts a nest level. A target attribute must
contain two attributes:
â NFXTA_NAME: the name of the target extension
â NFXTA_DATA: data private to this instance of the extension
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_TARGET |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload | nla_type = NFXTA_NAME |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. name of the extension, e.g. "TCPMSS" .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 + payload | nla_type = NFXTA_DATA |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. e.g. struct xt_tcpmss_info
same comment as above.
.
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_STOP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
2.5 Rule<sub:nfxta_rule>
A rule is started using the NFXTA_RULE attribute, which starts a
nest level, and is ended with an NFXTA_STOP attribute. Rules can
contain:
â Zero or more match extensions (NFXTA_MATCH..NFXTA_STOP).
â Zero or more target extensions (NFXTA_TARGET..NFXTA_STOP).
â Zero or one NFXTA_VERDICT attribute that specifies the rule's
verdict as data, which can either be NF_ACCEPT or NF_DROP.
(Non-normative notes: The supplied verdict is executed if no
target has reached a verdict on its own. Omission of the
verdict attribute counts as XT_CONTINUE.)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_RULE |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. matches, targets, verdict .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_STOP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
2.6 Chain<sub:nfxta_chain>
A chain is started using the NFXTA_CHAIN attribute, which starts
a nest level, and is ended with an NFXTA_STOP attribute. Chains
can contain:
â Zero or one of this group of three (= specify all three, or
none at all), specifying that this chain is a base chain
hooking in at some point:
â One NFXTA_HOOKNUM attribute for giving a hook number. This is
(unfortunately) dependent on the chosen nfproto, so it is
either NF_INET_*, NF_BR_* or NF_ARP_*.
â One NFXTA_PRIORITY attribute.
â One NFXTA_NFPROTO attribute that is NFPROTO_*.
â Zero or more rules (NFXTA_RULE..NFXTA_STOP).
Example of a fully populated chain:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_CHAIN |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8 | nla_type = NFXTA_HOOKNUM |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| hook number (0..7) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8 | nla_type = NFXTA_PRIORITY |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| priority (-2147483648..2147483647) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 8 | nla_type = NFXTA_NFPROTO |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nfproto value (2=ipv4, 3=arp, 7=bridge, 10=ipv6, 12=decnet) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. rules .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = 4 | nla_type = NFXTA_STOP |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3 Message types
3.1 IDENTIFYNFXTM_IDENTIFY: Identification
First and foremost a debug command. And to get something
(table/chain-independent) that users can glare at (they love
doing that).
Request:
â nlmsg_type = NFXTM_IDENTIFY;
Response:
â An NFXTA_NAMENFXTA_NAME attribute contains the name and version
of the implementation/patchset.
â Zero or more attributes of type NFXTA_MATCH, terminated by
NFXTA_STOP, giving meta information about the loaded match
extensions. Per available match, a group of three attributes
follows:
â One NFXTA_NAME attribute for the name of the extension
â One NFXTA_REVISION attribute to denote the version of the
extension's parameter protocol
â One NFXTA_SIZE attribute for the size of its per-instance
data block
We can avoid this if structures are splitted into several TLVs. You can
add new attributes and obsolete old ones.
â Zero or more attributes of type NFXTA_TARGET, terminated by
NFXTA_STOP, giving meta information about the loaded and
available target extensions:
â same attributes as with NFXTA_MATCH above
3.2 CHAIN_NEWNFXTM_CHAIN_NEW: Create new chain
Request:
â nlmsg_type = NFXTM_CHAIN_NEW;
â NFXTA_NAME attribute carrying the name of the new chain.
â Zero or one of this group of three:
â NFXTA_HOOKNUM
â NFXTA_PRIORITY
â NFXTA_NFPROTO
Response:
â Standard ACK.
Remarks:
Right now, a chain can only be promoted to a base chain during
creation (as far as the userspace view goes; when the kernel
exactly installs the nf_hook_ops is not of concern to userspace),
and it can only be demoted by deleting it. Should a
NFXTM_CHAIN_PROMOTE be split off the NFXTM_CHAIN_NEW
functionality?
3.3 CHAIN_DELNFXTM_CHAIN_DEL: Delete a chain
Request:
â nlmsg_type = NFXTM_CHAIN_DEL;
â NFXTA_NAME attribute carrying the name of the chain to delete
Response:
â Standard ACK.
3.4 CHAIN_MOVENFXTM_CHAIN_MOVE: Rename a chain
Request:
â nlmsg_type = NFXTM_CHAIN_MOVE;
â Two NFXTA_NAME attributes (order is important):
â First one specifies the current name of the chain
â Second one specifies the new name of the chain
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_len = at least 24 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_type = NFXTM_CHAIN_MOVE | nlmsg_flags = NLM_F_REQUEST |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_seq = whatever |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nlmsg_pid = whatever |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = at least 4 | nla_type = NFXTA_NAME |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. old name .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| nla_len = at least 4 | nla_type = NFXTA_NAME |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
. .
. new name .
. .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
3.5 CHAIN_DUMPNFXTM_CHAIN_DUMP: Chain dump
Request:
â nlmsg_type = NFXTM_CHAIN_DUMP;
â NFXTA_NAMENFXTA_NAME attribute specifying the name of the chain
to dump
Response:
â Zero or one of this group of three:
â NFXTA_HOOKNUMNFXTA_HOOKNUM, NFXTA_PRIORITYNFXTA_PRIORITY,
NFXTA_NFPROTONFXTA_NFPROTO.
â Zero or more NFXTA_RULE attributes as per section [sub:nfxta_rule]
.
Errors:
â If an error occurs during dump, an NFXTA_ERRNO attribute is
emitted into the stream and the dump will immediately terminate
with a standard NLMSG_DONE message. No NFXTA_STOP attributes
will be emitted if the dump stopped in the middle of a nesting
level.
3.6 TABLE_DUMPNFXTM_TABLE_DUMP: Table dump
Returns an atomic snapshot of the table.
Request:
â nlmsg_type = NFXTM_TABLE_DUMP;
Response:
â Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as described in
section [sub:nfxta_chain].
3.7 CHAIN_SPLICENFXTM_CHAIN_SPLICE: Add/delete rules
The NFXTM_CHAIN_SPLICE request does a bulk deletion of zero or
more consecutive rules, followed by a bulk insertion of zero or
more consecutive rules, all done in an atomic fashion. It
operates similar to Perl's splice function on arrays. The request
message needs to have at least the first three attributes.
Request:
â NFXTA_NAMENFXTA_NAME: Name of the chain to modify.
â NFXTA_OFFSETNFXTA_OFFSET: Index of entry where operation should
start.
â NFXTA_LENGTHNFXTA_LENGTH: Number of entries starting from
offset that should be removed. May be zero or more.
â Zero or more NFXTA_RULENFXTA_RULE as per section [sub:nfxta_rule]
.
Response:
â Standard ACK.
â Desired: detailed error code and origin of error (result of
running ->check in extensions)
3.8 TABLE_REPLACENFXTM_TABLE_REPLACE
Atomic exchange of an entire table.
Request:
â nlmsg_type = NFXTM_TABLE_REPLACE;
â Zero or more NFXTA_CHAINNFXTA_CHAIN attributes as per section [sub:nfxta_chain]
.
Response:
â Standard ACK.
â Desired: detailed error code and origin of error (result of
running ->check in extensions)
That's all by now. Quite exhaustive, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html