Re: Xtables2 A7 spec draft

James Nurmi <jdnurmi@xxxxxx> · Mon, 7 Feb 2011 12:50:40 -0800

(inline)

Comments are made as maintainer of GoNetlink, a 'not-C' language;
disregard as desired.

On Wed, Feb 2, 2011 at 2:04 PM, Jan Engelhardt <jengelh@xxxxxxxxxx> wrote:
>
>
> I am posting the Xtables2 Netlink interface specification, draft 7
> for comments.
>
> Additionally, further documentation and toolchain around
> it is available through the project page at
>
>        http://jengelh.medozas.de/projects/xtables/
>
>  * User Documentation Chapter 1: Architectural Differences
>  * Developer Documentation Part 1: Netlink interface (WIP)
>   This is copied below to facilitate inline replies
> --8<--
>
> Netlink interface
>
> 1 Concepts
>
> This section is non-normative and should instead show the flow of
> thought and give reasons as to why the specification was
> conceived the way it is, and where the component problems are.
>
> 1.1 Nesting representation
>
> The common element in Xtables is the ruleset, represented as a
> tree structure with ordering constraints at some levels:
>
> ruleset (unordered tables)
>  \__ table (unordered chains)
>  |    \__ chain (ordered rules)
>  |    |    \__ rule (ordered actions)
>  |    |    |    \__ match (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    |    |    \__ bin params
>  |    |    |    |    \__ state-data
>  |    |    |    |         \__ nlattrs
>  |    |    |    \__ match...
>  |    |    |    \__ target (unordered data)
>  |    |    |    |    \__ config-data
>  |    |    |    \__ target...
>  |    |    |    \__ verdict...
>  |    |    \__ rule...
>  |    \__ chain...
>  \__ table...
>
> A more concrete example, here is a small ruleset, encoded into
> XML (just one of many possible representations):
>
> <table>
>  <chain name="INPUT">
>    <rule idx="1">
>      <match acidx="1" name="hashlimit" rev="1" csize="120">
>        <config-data>...</config-data>
>        <state-data>...</state-data>
>      </match>
>      <target acidx="2" name="TOS" rev="1">
>        ...
>      </target>
>      <verdict acidx="3" name="ACCEPT" />
>    </rule>
>  </chain>
> </table>
>
> There are different ways to encode such a tree structure into a
> serialized stream. In many Netlink protocols, children attributes
> are encapsulated (a. k. a. “nested”, though we will avoid this
> term to avoid double-use) and treated as a whole as a parent's
> opaque data. It cannot be told apart from normal data. (Like
> writing “<chain> &lt;rule&gt; ... &lt;/rule&gt; </chain>” in
> XML.) We will call this format “Encapsulated Encoding”.
>
> To encode an attribute's length, struct nlattr only has a 16-bit
> field, which means the attribute header plus payload is limited
> to 64 KB. This is easily exceedable with the encapsulated
> encoding as chains are collected rules in a chain, for example.
> The problem is aggreviated by the kernel's Netlink handler only
> allocating sk_buffs a page size worth, which leaves few room for
> extension data. In the worst case, the usable payload for
> attributes is around 3600 bytes only. In light of xt_u32's
> private data block being 1984 bytes already, that means that you
> won't be able to fit two -m u32 invocations nested in a single
> rule into a dump.
>
> Certain voices in the community call for the obsoletion of such
> data blobs and replace them by Netlink attributes; there are no
> objections to doing so. However, the problem of size-limited
> sk_buffs applies to opaque data of any kind, and Netlink
> attributes fall within that.

I'm all for of opaque data-blobs where the user is not expected to
understand the data underneath (FILE handles), but only so far as they
can be safely serialized to alternate processes for collection of
additional data (no *pointers, and only TLV styled abstractions)

>
> The Xtables2 Netlink protocol encodes each node of information as
> a standalone attribute, to be called Flat Encoding, that is
> appended (a. k. a. “chained”) to the data stream. By avoiding
> encapsulated attributes, it is possible to split messages at much
> finer levels, and provides for attributes that happen to use
> opaque data with a maximally-sized buffer.
>
> 1.2 Nest markers<sub:Nest-markers>
>
> Since Netlink messages do have a 32-bit quantity to store the
> messagelength, rulesets of roughly up to 4 GB are possibile,
> which is currently regarded as sufficient. The largest (while
> still being meaningful) rulesets seen to date in the industry
> weighed in at approximately 150 MB.

While managing tables/rules/etc atomically should be priority #1, I'm
not certain if optimizing the protocol for this makes a lot of sense
for either the user or kernel contexts.

>
> Whereas encapsulated attribute encoding automatically provided
> for boundaries, this is realized using dummy attributes in the
> chained approach. The start of a nesting level can be implicitly
> represented by the presence of the attribute that would have
> otherwise been used for encapsulated nesting. For declaring an
> end of a nest level, an extra attribute is needed:
>
> • “chain { rule; rule; ... }” \Leftrightarrow CHAIN RULE RULE ...
>  STOP
>
> 1.3 Attribute limitations in nfnetlink
>
> Netlink, being just a base protocol, does not specify what comes
> after the nlmsghdr, or how it is ordered. This is left up to the
> subprotocols based on Netlink. nfnetlink has two effective
> shortcomings (due to its parser) that shall be held in mind:
>
> • Attribute ordering is ignored and lost

(GoNetlink doesn't adhere to this belief; I didn't realize there was
any standardization of this approach outside of the libnfnetlink
implementation, and so assumed I'd be screwed if I followed it.)

>
> • No support for more than one attribute with the same type
>  within a message

ditto

> 1.4 Summary of transform<sub:Summary-of-transform>
>
> Essentially there is a 1:1 transform on the XML-like tree shown
> above, to:
>
> NFXTM_CHAIN_ENTRY<name=INPUT,usertid=1>
>  NFXTM_RULE_ENTRY<idx=1,usertid=1>
>    NFXTM_MATCH_ENTRY<acidx=1,name=hashlimit,rev=1,usertid=2>
>      NFXTM_CONFIG_DATA
>        NFXTM_ARB_DATA<whatever>
>        NFXTM_ARB_DATA<more arbitrary data>
>      NFXTM_STOP
>      NFXTM_STATE_DATA
>        NFXTM_ATTR_DATA<nlattrs>
>        NFXTM_ATTR_DATA<more nlattrs>
>      NFXTM_STOP
>    NFXTM_STOP
>    NFXTM_TARGET_ENTRY<acidx=2,name=TOS,rev=0,usertid=3>
>      ...
>    NFXTM_STOP
>    NFXTM_VERDICT_ENTRY<acidx=3,name=ACCEPT,usertid=3>
>    NFXTM_STOP
>  NFXTM_STOP
> NFXTM_STOP
>
> 1.5 Extra sequence numbers<sub:Extra-sequence-numbers>
>
> Netlink also does not specify any message ordering, though it
> does provide an nlmsg_seq field with which message order can at
> least be determined. The problem is that nothing specifies what
> nlmsg_seq should be in reply messages. It is assumed that the
> sequence number is linked, i. e. that a reply's number should be
> the same as the request's number, to do message matching (vague
> hint by netlink(7) manpage).

RFC 3549 (2.3.2.1) seems to support you in that the usage of sequence
numbers is undefined; My experience has been to expect the response to
match the request and dispatch accordingly -- since thats the 'norm',
and netlink shouldn't ever fail,  I'd actually rather see the protocol
use NLM_F_MULTI, NLM_F_ATOMIC pair, with an internal
sequence/timestamp for clients that really need an atomic state.

>
> Even if that were decidedly so, that brings along a problem. In
> NLM_F_MULTI-style dumps, all messages would have the same
> nlmsg_seq. To counter this, multi messages will have an
> NFXT-specific sequence counter (NFXTA_SEQNO) in addition,
> especially since ordering is so much more crucial in Xtables than
> it is in other parts of networking.

That, to me, is fine -- netlink is an encapsulation from my view,
MULTI is the right way to do long messages.

>
> 1.6 Improved granularity error reporting
>...

As a non-C implementation, I'd prefer constant error (class) with
flags (bitfield), but expect to rewrite a lot of constants anyhow.

> 1.7 Multi-type responses
> ...

Most RTNetlink protocols (which will be a similar user base I imagine)
make assumptions on the response type based on the query type;  In go,
for example, there is no generic, so re-decomposing a response becomes
expensive.

Personally, I would prefer that responses be limited solely to the
query I provided or an error, not something with multiple (possibly
confounding?) types.

> ...
> 3 Attributes
>
> The meaning of attributes depends upon the message and logical
> nesting level in which they appear. Their type however remains
> the same, such that a single Netlink attribute validation policy
> object (struct nla_policy) can be used for all message types.
>
> A table of all known attributes:
>
>
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> | Value  | Mnemonic          |    C type     | NLA type        | Notes                                |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |   1    | NFXTA_SEQNO       | unsigned int  | NLA_U32         | Section [sub:Extra-sequence-numbers] |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |  tba   | NFXTA_ERRNO       |     int       | NLA_U32         | Generic system errno (Exxx)          |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |  ...   | NFXTA_XTERRNO     |     int       | NLA_U32         | NFXT errno (NFXTE_*)                 |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_ERRSTR      |   char []     | NLA_NUL_STRING  | Arbitrary                            |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_USERTID     | unsigned int  | NLA_U32         | Arbitrary, retained verbatim         |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_CHAIN_NAME  |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_RULE_IDX    | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_ACTION_IDX  | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NAME        |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_REVISION    |   uint8_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_HOOKNUM     | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_PRIORITY    |     int       | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NFPROTO     |   uint8_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_OFFSET      | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_LENGTH      |    size_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_HOOKMASK    | unsigned int  | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_SIZE        |    size_t     | NLA_U32         |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+
> |        | NFXTA_NEW_NAME    |   char []     | NLA_NUL_STRING  |                                      |
> +--------+-------------------+---------------+-----------------+--------------------------------------+

W/r/t the NUL_STRING's -- is there a good reason to use a NUL'd
strings for NAME/etc, given the length is known? Wouldn't it make more
sense to simply require a byte string and apply the null internally? I
see this frequently in Netlink, and imagine it's a kernel consistency
thing?

>
>
> The kernel ignores attributes with value 0 during validation, so
> it was left unused.
>
> 4 Error types<sec:Error-types>
>
>
> +--------+---------------------+-------------------------------------------+
> | Value  | Mnemonic            | Description                               |
> +--------+---------------------+-------------------------------------------+
> +--------+---------------------+-------------------------------------------+
> |   0    | NFXTE_SUCCESS       | No error                                  |
> +--------+---------------------+-------------------------------------------+
> |   1    | NFXTE_CHAIN_EXIST   | Chain already exists                      |
> +--------+---------------------+-------------------------------------------+
> |   2    | NFXTE_CHAIN_NOENT   | Chain does not exist                      |
> +--------+---------------------+-------------------------------------------+
> |   3    | NFXTE_RULESET_LOOP  | Ruleset contains a loop                   |
> +--------+---------------------+-------------------------------------------+
> |   4    | NFXTE_EXT_HOOKMASK  | Rule invoked from incompatible hook       |
> +--------+---------------------+-------------------------------------------+
> |        | NFXTE_PROMO_STATUS  | Promotion/demotion state already achieved |
> +--------+---------------------+-------------------------------------------+
>
>
> 5 Message types
> ...

My biggest concern here seems as already pointed out -- the use of
STOP && deep nesting in messages;  Every time a STOP occurs in an
internal message, it's semantically equivalent to the completion of an
NF_F_MULTI no?

I see the advantage of a trivial protocol, but wouldn't it be much
simpler to have a 'bigger' protocol (table/chain/rule) with an
optional ATOMIC guarantee?

I don't see anywhere else guaranteeing tables/matches/rules will be
managed (as a set) with atomicity [I'm probably wrong], so doing it in
the protocol feels awkward.

There area  LOT of definitions of atomicity, ordering, etc within this
area, making me feel like doing that 'up one level' and in smaller
pieces might make for more manageable interface

Still, this all looks like phenomenal progress, and I look forward to
seeing it move on.

James
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html