Re: [PATCH 1/3 nfnetlink_acct] numerous changes and improvements to the kernel code

Michael Zintakis <michael.zintakis@xxxxxxxxxxxxxx> · Sun, 14 Apr 2013 10:50:26 +0100

Hello Pablo,

Apologies once again for this late post...

Pablo Neira Ayuso wrote:
> Hi Michael,
> 
> On Thu, Apr 04, 2013 at 09:37:03PM +0100, Michael Zintakis wrote:
>> Hello Pablo,
>>
>> Pablo Neira Ayuso wrote:
>>>>> BTW, atomic is not required for those two fields, this is protected by
>>>>> the nfnl_lock.
>>>> I disagree with you entirely.
>>> At least you have to agree with me in that atomic is not required for
>>> those variables :)
>> I am not sure I do, sorry Pablo!
>>
>> If we demote the two variables from "atomic_t" and "atomic64_t" to
>> uint32_t and uint64_t respectively, then if thread A is in the
>> middle of changing 'bthr' for example, and thread B tries to access
>> it, there is nothing stopping both threads gaining access to that
>> variable since it isn't afforded the protection of the atomic
>> operations (put it simple there isn't atomic lock to prevent such
>> access), which will cause a mess.
> 
> Those variables are already protected by nfnl_lock in
> net/netfilter/nfnetlink.c.
Ah, I think I understand this now - since these two variables are modified only via the nfnetlink implementation (and not the kernel itself), then nfnl_lock() function comes into play since nfnl_nfacct_new is called by nfnetlinkc::nfnetlink_rcv_msg (where nfnl_lock is "active"), preventing another request (made possibly by another thread also via nfnetlink) to update these two variables in nfnetlink_acct.c. Is my assumption correct or do I have this wrong?

If it is correct, you are right that in such a case we don't really need the fmt and bthr variables to be atomic as no race condition can occur. Thank you for this clarification!

>> There is a similar case with the "refcnt" variable - it is atomic
>> and therefore protected, so I don't see if I change these 2
>> variables (bthr and fmt) to "normal" unsigned integers how (or what)
>> will protect the access to these variables and prevent what I just
>> described from happening...
>>
>>>> The two fields above aren't "meaningless" - they have a purpose and
>>>> are an integral part of the nfacct object. None of the fields in the
>>>> nfacct object are really used in the operation of the kernel itself
>>>> - the kernel merely updates these. To put it bluntly, the whole
>>>> nfacct struct is used as storage and the kernel don't use any of
>>>> these fields in its core operation. OK, I agree that packet and byte
>>>> counters are going to be updated much more frequently by the kernel
>>>> than the fmt and bthr variables, but the fact is that these,
>>>> strictly speaking, do not dependent for the kernel to operate. What
>>>> I've done above is not a unique approach - in the kernel code there
>>>> are a lot of examples like this, it is not something new and we are
>>>> not re-inventing the wheel here or breaking a new ground.
>>>>
>>>> Another reason for having fmt and bthr as part of the kernel nfacct
>>>> struct is that this data, alongside the bytes and packet counters,
>>>> is shielded and protected (in ring0), its integrity guaranteed and
>>>> can only be changed via the kernel itself and nothing else. This is
>>>> very important for us.
>>> Sorry, I don't think putting thing in the kernel makes things more
>>> secure.
>> These properties are not directly accessible or arbitrarily
>> modifiable (whether intentional or not) via userspace so in that
>> sense the information contained within is "shielded" or "protected"
>> by the kernel, which is what I meant with my previous posting
>> (sorry, my English is not 100%).
> 
> You can modify kernel information if you manage to load a module and
> have some little knowlege on where to find things in the memory space.
> Truly, I think you can achive this by protecting the configuration
> file from changes in your tree and some extra checksumming to follow
> the track of unexpected changes.
I understand that Pablo, but this is quite a bit far-fetched and, in my view, less practical (both in terms of performance as well as memory footprint) where compared to just using these 2 properties in the kernel where they are protected without the need to write extra code.

With placing these properties into the kernel they will also occupy less memory and the performance would be better, compared to if all this is done in user space (not to mention possible synchronization issues as the code in user space needs to "track" nfacct objects in the kernel and do that flawlessly).

As I already mention previously, these two properties form a logical part of the nfacct object - they are not some separate and completely independently detached "entities".

>>>> In the early stages of the nfacct changes, we considered a very
>>>> similar approach to what you have proposed above, but rejected it,
>>>> not least because the integrity and security of the data
>>>> encapsulated in the nfacct struct cannot be guaranteed otherwise. We
>>>> will rely heavily on this and don't need it to be changed
>>>> arbitrarily (either by "mistake" or intentionally), so we decided
>>>> that this is the best way going forward.
>>>>
>>>> In stage 2 (current plans are this to be completed by the end of q2
>>>> of this year) we will extend this and create a specifically designed
>>>> daemon (nfacctd), which will "talk" directly to the kernel and
>>>> retrieve accounting data (much like what nfacct executable currently
>>>> does, but on a bigger scale) on-demand and in real time and then
>>>> send it to a central point where such data will be collected from
>>>> all other machines for statistics and analysis. That daemon can't be
>>>> spending extra cpu cycles parsing files that may or may not exist or
>>>> check whether the data in them is or isn't reliable (whether it is
>>>> "changed" on purpose or not) - that really isn't good enough.
>>> I think that daemon you want to implement is ulogd2. There's already a
>>> plugin available for nfacct. Thus, you can skip parsing the
>>> configuration file over and over again. To dump existing stats, you
>>> can add a unix socket interface for this new extension you want to do.
>> OK, let me illustrate my point with a little pictogram. We have two
>> separate possible scenarios:
>>
>> Scenario 1:
>> Request Station (RS) <=> Internal (trusted) network <=> Home Station (HS) - where nfacctd daemon and friends is located
>>
>> Scenario 2:
>> Request Station <=> Public (untrusted) network <=> Home Station
>>
>> In the above scenarios, the "Request Station" can be two different
>> types: a) a central server gathering statistics for all registered
>> machines on the network; or b) a separate work station running
>> nfacct.
>>
>> In case a) above, what usually happens is that the RS connects to
>> nfacctd on all registered machines using tcp connection and tells
>> them "I want traffic stats from you every X seconds" (without this
>> initial command the nfacctd daemon is idle and simply listens on the
>> pre-defined tcp port).
>>
>> Once this is done, the nfacctd daemon start sending those stats to
>> RS and gets an acknowledgment. If that doesn't happen, say, 3 times
>> in a row - because the RS may be inaccessible or for other reasons,
>> then the daemon stores these stats in memory if permissible up to a
>> point for later transmission, while regularly trying to connect to
>> the RS.
>>
>> Periodically, particularly during "peak" times when we expect
>> traffic to be high, the central server on RS can send a command to
>> the nfacctd daemons to throttle up the interval at which these
>> statistics are sent. Similarly, during "slow-traffic" times, the
>> opposite may happen and this interval can be extended.
>>
>> In addition, the RS needs to be able to request the current traffic
>> stats for a particular nfacct object (similar to "nfacct get"), as
>> well as be able to change the threshold and format properties of an
>> individual nfacct object on the HS.
>>
>> Another possible type of "Request Station" we need to implement
>> (this would require extending the nfacct executable capabilities
>> once again) is a RS requesting full or partial traffic statistics by
>> querying nfacctd on a particular Home Station using the nfacct
>> executable. In other words, be able to do this:
>>
>> nfacct list [host <HS_HOST>] [port <HS_PORT>]
>> nfacct get <object-name> [host <HS_HOST>] [port <HS_PORT>]
>> nfacct add <object-name> replace [format FMT_SPEC] [threshold THRESHOLD] [host <HS_HOST>] [port <HS_PORT>]
>>
>> In other words, nfacct from the RS should be able to connect to
>> nfacctd daemon on HS_HOST:HS_PORT and request traffic statistics on
>> all objects, a single object, as well as be able to change the
>> format and threshold values of a particular nfacct object.
> 
> That sounds to me a lot like IPFIX. There's some preliminary support
> for it in ulogd2, it should be relatively easy to finish it.
> 
>> The connection itself will usually be on a pre-defined tcp port
>> (above the 30000 range). For the final stage of the project (this is
>> still in its infancy though) we will need to implement Scenario 2
>> above: if the medium used by the RS to connect to the HS is
>> untrusted network, then we will deploy tls encryption (that will, of
>> course, require certificate chain verification on both sides).
>>
>> I haven't looked at the ulog2 and the nfacct module implementation
>> there in great detail, but from what I can gather, it just issues
>> and records stats on all nfacct objects at pre-selected interval
>> configurable at start-up (via the "pollinterval" option) and all
>> this is done locally, which is obviously insufficient for us.
> 
> I mean that ulogd2 provides the framework for Netfilter's user-space
> logging. It's pretty extensible. You will surely need to make some
> changes for what you need, but most of the infrastructure is already
> there. No need for another separated daemon.
My concern with this is that even though you say it is extensible, there will be quite a lot of redesigning and rewriting of code - the ulogd2 primary function is to check packets and pipe this through various "channels" defined in advance at start up, which is quite different from the functionality required by the nfacct daemon. To use a common expression - I think using ulogd2 as nfacct daemon will be like trying to put square pegs in round holes - not impossible, but difficult and cumbersome to implement somewhat.

Having written all of this, I will leave this alternative active and see if something can be done, even if we don't actually deploy this, for the benefit of the community.

>>>>>>  	if (matching) {
>>>>>>  		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>>>>>> -			/* reset counters if you request a replacement. */
>>>>>> +			/* reset counters if you request a replacement */
>>>>>> +			if (!tb[NFACCT_PKTS]) {
>>>>>> +				/*
>>>>>> +				 * Prevent resetting the packets counter if
>>>>>> +				 * either fmt or bthr are specified.
>>>>>> +				 *
>>>>>> +				 * This is done for backward compatibility,
>>>>>> +				 * otherwise resetting these counters should
>>>>>> +				 * only be allowed when tb[NFACCT_PKTS] is
>>>>>> +				 * explicitly specified and == 0.
>>>>>> +				 *
>>>>>> +				 */
>>>>>> +				if (!tb[NFACCT_FMT] &&
>>>>>> +				    !tb[NFACCT_BTHR]) {
>>>>>>  			atomic64_set(&matching->pkts, 0);
>>>>>> +				}
>>>>>> +			} else {
>>>>>> +				atomic64_set(&matching->pkts,
>>>>>> +				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
>>>>> The replacement operation is not so easy. Note that you may hit
>>>>> inconsistencies if while replacing the packet counter, the kernel
>>>>> updates the byte counter, and then you replace the byte counter. You
>>>>> would be leaking bytes and packets.
>>>>>
>>>> If I understand you correctly Pablo, you are worried that the packet
>>>> and byte counters can be, in a way "misaligned". But if we are to
>>>> update these values we don't really care if the old values get
>>>> updated or not prior to that change, do we?
>>>>
>>>> In other words, the packet value is updated and the kernel changes
>>>> the byte value in the meantime. If we are later (in the same call)
>>>> replacing the bytes value why do we care that there is a momentary
>>>> inconsistency - we are going to replace it with a new value anyway?
>>> IMO we should care.
>> Pablo, let me re-cap so that I make sure we are discussing the same
>> thing...
>>
>> Since nfnetlink_acct.c allows only packet or only byte counters to
>> be updated, this may cause inconsistency (a valid concern!). So,
>> what I proposed in my previous posting is roughly the following:
>>
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> /*...*/
>> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
>> 			atomic64_set(&nfacct->pkts,
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS])));
>> 			atomic64_set(&nfacct->bytes,
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES])));
>> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
>> 			return -EINVAL;	// or "return -EBUSY;"
>> /*...*/
>>
>> If I understand you correctly, you are concerned that between the
>> time after "atomic64_set(&nfacct->pkts..." is executed and
>> "atomic64_set(&nfacct->bytes" not yet started/finished, even though
>> these two statements follow one after another, you think there will
>> be a momentary inconsistency between the byte and packet counter
>> values, is that right?
>>
>> If that is so, then the existing kernel code is having the same
>> problem. For example:
>>
>> a)
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> 		if (nlh->nlmsg_flags & NLM_F_REPLACE) {
>> 			/* reset counters if you request a replacement.	*/
>> 			atomic64_set(&matching->pkts, 0);
>> 			atomic64_set(&matching->bytes, 0);
> 
> We should check there if that acct object is in used.
Why? In my opinion, this is not necessary. The replace "mode" can be needed in both cases: where refcnt > 0 (nfacct object is used) and also where refcnt == 0 (nfacct object is not used). I don't see why do we need to make that distinction - we should be able to replace nfacct properties regardless of whether that object is used (in other words, regardless of the value of refcnt).

> I think there is
> no user-space code using the NLM_F_REPLACE flag IIRC.
I don't know whether that is the case or not as I don't have access to all machines on which this kernel code is deployed. We certainly are using this (as part of the "nfacct add/get" as well as "nfacct restore").

The above, I think, is a mute point since the code is there and is part of the current kernel and therefore we should consider the possibility that this functionality has been used (otherwise why implement it?).

The point I was trying to make is that what I proposed with the patch is not different from what currently exists in the kernel code (above) in view of your concerns.

> 
>> 			return 0;
>> 		}
>> 		return -EBUSY;
>> 	}
>>
>> b)
>> nfnetlink_acct.c::nfnl_acct_fill_info:
>> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
>> 		pkts = atomic64_xchg(&acct->pkts, 0);
>> 		bytes =	atomic64_xchg(&acct->bytes, 0);
>> 	} else {
>> 		pkts = atomic64_read(&acct->pkts);
>> 		bytes =	atomic64_read(&acct->bytes);
>> 	}
> 
> In this case, we don't leak packet/bytes counters.
Isn't the first part of this "If" statement the equivalent of what is currently done in the existing nfnetlink_acct.c::nfnl_acct_new (clearing both counters on replace "mode")? To me that is the same thing with the same problems I described above. The only difference is that we save the "old" values in a pair of variables.

> 
>> c)
>> nfnetlink_acct.c::nfnl_acct_update:
>> 	atomic64_inc(&nfacct->pkts);
>> 	atomic64_add(skb->len, &nfacct->bytes);
>>
>> In all of the above cases (a-c) the update statements are identical
>> with what I proposed above, so if this is undesirable, then the
>> existing code in the kernel in the above 3 cases I listed should
>> also be changed, correct?
> 
> I don't see why c) is wrong. My concern is mostly about leaking data.
I am assuming that by that you mean byte and packet counters not updated synchronously when we "lose" either packet or byte counts, correct?

>>> Let me check if I can come with some lockless way
>>> to replace counters, so we can remove that -EBUSY limitation.
>> This discussion we are having pointed me to some interesting idea.
>>
>> Packet and byte counters are always updated in "unison" - one cannot
>> (or should not) be updated without the other. The current "atomic"
>> operations are performed separately (and independently!) for packet
>> and then byte counters. Each atomic operation (atomic64_set,
>> atomic64_inc etc) does a spinlock, again, separately for packet and
>> byte counters. If we are always updating these variables in "unison"
>> together, then we don't really need to do 2 locks - we need just
>> one.
>>
>> In other words, we can "demote" pkts and bytes variables from
>> "atomic64_t" to uint64_t and use a manual lock which is specific for
>> each nfacct object to update these values. In other words, something
>> like this:
>>
>> struct nf_acct {
>> 	uint64_t		pkts;
>> 	uint64_t		bytes;
>> 	atomic64_t		bthr;
>> 	atomic_t		fmt;
>> 	struct list_head	head;
>> 	atomic_t		refcnt;
>> 	char			name[NFACCT_NAME_MAX];
>> 	/* Lock	protecting packet & byte counters */
>> 	rwlock_t		pb_lock;
>> 	struct rcu_head		rcu_head;
>> };
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_new:
>> 	if (matching) {
>> /*...*/
>> 		if (tb[NFACCT_PKTS] && tb[NFACCT_BYTES]) {
>> 			write_lock_bh(&nfacct->pb_lock);
>> 			nfacct->pkts =
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_PKTS]));
>> 			nfacct->bytes =
>> 				be64_to_cpu(nla_get_be64(tb[NFACCT_BYTES]));
>> 			write_unlock_bh(&nfacct->pb_lock);
>> 		} else if (tb[NFACCT_PKTS] || tb[NFACCT_BYTES])	{
>> 			return -EINVAL;	// or "return -EBUSY;"
>> /*...*/
>>
>> 	nfacct = kzalloc(sizeof(struct nf_acct), GFP_KERNEL);
>> 	if (nfacct == NULL)
>> 		return -ENOMEM;
>>
>> 	strncpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
>> 	rwlock_init(&nfacct->pb_lock);
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_fill_info:
>> 	if (type == NFNL_MSG_ACCT_GET_CTRZERO) {
>> 		write_lock_bh(&acct->pb_lock);
>> 		pkts = acct->pkts;
>> 		bytes =	acct->bytes;
>> 		acct->pkts = 0;
>> 		acct->bytes = 0;
>> 		write_unlock_bh(&acct->pb_lock);
>> 	} else {
>> 		read_lock_bh(&acct->pb_lock);
>> 		pkts = acct->pkts;
>> 		bytes =	acct->bytes;
>> 		read_unlock_bh(&acct->pb_lock);
>> 	}
>> /*...*/
>>
>> nfnetlink_acct.c::nfnl_acct_update:
>> 	write_lock_bh(&nfacct->pb_lock);
>> 	nfacct->pkts += 1;
>> 	nfacct->bytes += skb->len;
>> 	write_unlock_bh(&nfacct->pb_lock);
>>
>> That way, we use only one lock (instead of two as was the case with
>> "atomic64_set" and friends) and there should not be any concerns any
>> longer as to whether there will be any inconsistency between packet
>> and byte counter values because these are applied in "unison" during
>> a write lock, which is later released after the operation is
>> complete. What do you think, would that be satisfactory?
> 
> Eric Dumazet proposed during the latest workshop that we could add
> some flag to allow different flavours of acct object:
> 
> * per-cpu based, they consume more memory, but they are lockless.
By "lockless" you mean using atomic-types? They aren't "lockless" since the atomic operations do uses a lock. As I pointed out before, for updating packet and bytes counters we use two such locks (where I think only one is needed since we should update the byte and packet counters together, not separate).

> * compact, lock based, they consume little memory, but bottleneck is
>   locking.
I'll see what would be the difference in performance between atomic locking (or "lockless" as you mentioned above) and using a "proper" lock (what I proposed with my previous posting) - it would be interesting to check this out.

> So you select what you need according to your requirements.  I plan to
> send some patches to support this.
OK, at least for the kernel part, I will then update the patches to v2 in a few moments and make sure that:
* packet and byte counters are updated together in nfnl_acct_new and issue "EINVAL" error if that is not the case;
* fmt and bthr are "normal" variables and drop the "atomic" part;

Thanks Pablo.

MZ

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html