Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote: > From: Kiran Kella <kiran.kella@xxxxxxxxxxxx> > > Changes done in the kernel to ensure 3-tuple uniqueness of the conntrack > entries for the fullcone nat functionality. > > * Hashlist is maintained for the 3-tuple unique keys (Protocol/Source > IP/Port) for all the conntrack entries. > > * When NAT table rules are created with the fullcone option, the > SNAT/POSTROUTING stage ensures the ports from the pool are picked up in > such a way that the 3-tuple is uniquely assigned. > > * In the DNAT/POSTROUTING stage, the fullcone behavior is ensured by checking > and reusing the 3-tuple for the Source IP/Port in the original direction. > > * When the pool is exhausted of the 3-tuple assignments, the packets are > dropped, else, they will be going out of the router they being 5-tuple > unique (which is not intended). > > * Passing fullcone option using iptables is part of another PR (in > sonic-buildimage repo). These are way too many changes for a single patch. Please consider splitting this ino multiple chunks, e.g. at least separate functional fullcone from the boilerplate changes. > The kernel changes mentioned above are done to counter the challenges > explained in the section *3.4.2.1 Handling NAT model mismatch between > the ASIC and the Kernel* in the NAT HLD [1]. And please add the relevant explanations from this > [1]: https://github.com/kirankella/SONiC/blob/nat_doc_changes/doc/nat/nat_design_spec.md ... to the commit message. > This is taken from switch network operating system (NOS) SONiC’s Linux > repository, where the support was added in September 2019 [1], and > forwarded ported to Linux 4.19 by Akhilesh in June 2020 [2]. > I am sending it upstream as a request for comments, before effort > is put into forward porting it to Linux master. I don't see any huge problems from a technical pov. But I don't see why this functionality is needed from a pure SW point of view. AFAICS SONiC uses a propritary (or at least, custom) offload mechanism to place nat entries into HW. Netfilter already has a forwarding offload mechanism, described in Documentation/networking/nf_flowtable.rst , so I'm not sure it makes sense to accept this without patches to support the needed offload support as well. AFAIU passing fullcone makes no sense unless using offload HW that doesn't support the current nat port allocation scheme. And current kernel doesn't support any such HW. Nevertheless, some comments below. > +/* Is this 3-tuple already taken? (not by us)*/ > +int > +nf_nat_used_3_tuple(const struct nf_conntrack_tuple *tuple, > + const struct nf_conn *ignored_conntrack, > + enum nf_nat_manip_type maniptype); > + > */ > - void (*unique_tuple)(const struct nf_nat_l3proto *l3proto, > + int (*unique_tuple)(const struct nf_nat_l3proto *l3proto, > struct nf_conntrack_tuple *tuple, > const struct nf_nat_range2 *range, > enum nf_nat_manip_type maniptype, The above change should be done in a separate patch, so its in a isolated change. This will ease review of the fullcone part. > /* generate unique tuple ... */ > -static void > +static int > gre_unique_tuple(const struct nf_nat_l3proto *l3proto, > struct nf_conntrack_tuple *tuple, > const struct nf_nat_range2 *range, > @@ -52,7 +52,7 @@ gre_unique_tuple(const struct nf_nat_l3proto *l3proto, > /* If there is no master conntrack we are not PPTP, > do not change tuples */ > if (!ct->master) > - return; > + return 0; > > if (maniptype == NF_NAT_MANIP_SRC) > keyptr = &tuple->src.u.gre.key; > @@ -73,11 +73,11 @@ gre_unique_tuple(const struct nf_nat_l3proto *l3proto, > for (i = 0; ; ++key) { > *keyptr = htons(min + key % range_size); > if (++i == range_size || !nf_nat_used_tuple(tuple, ct)) > - return; > + return 1; > } I suggest you use 'bool' type for this rather than int, unless you plan to use errno codes here in some future change. > @@ -155,6 +156,31 @@ hash_by_src(const struct net *n, const struct nf_conntrack_tuple *tuple) > return reciprocal_scale(hash, nf_nat_htable_size); > } > > +static inline unsigned int > +hash_by_dst(const struct net *n, const struct nf_conntrack_tuple *tuple) please avoid inline keyword in .c files for new submissions. > +{ > + unsigned int hash; > + > + get_random_once(&nf_nat_hash_rnd, sizeof(nf_nat_hash_rnd)); get_random_once can't be called from multiple places for the same random value. [ I did not check if thats the case since patch doesn't apply to current nf ]. > +static inline int > +same_reply_dst(const struct nf_conn *ct, > + const struct nf_conntrack_tuple *tuple) > +{ > + const struct nf_conntrack_tuple *t; > + > + t = &ct->tuplehash[IP_CT_DIR_REPLY].tuple; > + return (t->dst.protonum == tuple->dst.protonum && > + nf_inet_addr_cmp(&t->dst.u3, &tuple->dst.u3) && > + t->dst.u.all == tuple->dst.u.all); > +} Please run patches through scripts/checkpatch.pl before submission, we try to avoid '()' in return (...); > +/* Only called for DST manip */ > +static int > +find_appropriate_dst(struct net *net, > + const struct nf_conntrack_zone *zone, > + const struct nf_nat_l3proto *l3proto, > + const struct nf_nat_l4proto *l4proto, > + const struct nf_conntrack_tuple *tuple, > + struct nf_conntrack_tuple *result) > +{ > + struct nf_conntrack_tuple reply; > + unsigned int h; > + const struct nf_conn *ct; Silly, but some maintainers prefer reverse-xmas-tree, i.e. struct nf_conntrack_tuple reply; const struct nf_conn *ct; unsigned int h; > @@ -327,8 +421,11 @@ get_unique_tuple(struct nf_conntrack_tuple *tuple, > const struct nf_conntrack_zone *zone; > const struct nf_nat_l3proto *l3proto; > const struct nf_nat_l4proto *l4proto; > + struct nf_nat_range2 nat_range; > struct net *net = nf_ct_net(ct); > > + memcpy(&nat_range, range, sizeof(struct nf_nat_range2)); nat_range = *range ? > @@ -1055,9 +1188,14 @@ static int __init nf_nat_init(void) > if (!nf_nat_bysource) > return -ENOMEM; > > + nf_nat_by_manip_src = nf_ct_alloc_hashtable(&nf_nat_htable_size, 0); > + if (!nf_nat_by_manip_src) > + return -ENOMEM; This lacks error unwind for nf_nat_bysource.