Re: null-pointer deref in ulogd2

Bernhard Schmidt <berni@xxxxxxxxxxxxx> · Tue, 23 Jun 2009 18:54:46 +0200

Hi Pablo,

now it seems to work okay. In the database about 90% of the flows have
flow_end_sec NULL.
Please, rise "netlink_socket_buffer_size" and
"netlink_socket_buffer_maxsize". If you use the default buffer, it's
likely to overrun and, thus, to lose events.

We had increased that in the meantime, to

netlink_socket_buffer_size=10854400
netlink_socket_buffer_maxsize=20971520

That pretty much stopped the warning messages in /var/log/ulogd.log

We had also figured that the hash was the problem, so we tried the 
hash_enable=0 and used the INSERT_OR_REPLACE_CT function. However, that 
was also pretty unsuccessful, right now we have 750k flows in ulog2_ct 
where ct_event < 4 (so, as far as I understand it, the DESTROY event has 
not yet been received). Which is a bit too much for a box that only has 
40k-50k connections at the same time according to conntrack -L. 1.67M 
flows in total, I suspect that's a bit low as well. When I did 100 HTTP 
connections through the box I could only find ~20 flows in the database, 
none of them in DESTROYed state.

What is happening here?
I think that you're using the default "hash_max_entries" which is too
small. I suggest you to rise this value. I'm going to push a patch that
includes information on these parameter tweaking to the example config file.

I've now set

hash_buckets=81920
hash_max_entries=327680

and went back to hash_enable=1.

However, it still doesn't look too great. About five minutes after 100 
TCP connects the number of flows in the ulog2_ct table for this IP 
address has stabilized at 116, consisting of
- 9 flows with both flow_start_sec and flow_end_sec
- 83 flows with only flow_start_sec
- 24 flows with only flow_end_sec

SELECT COUNT(DISTINCT orig_l4_sport) tells me that 92 real connections 
are listed in the table somehow, so 8 connections are totally lost and 
24 connections are listed twice.

[ half an hour later ]

ARGH! I found my problem. Apparently Postgres was too slow on INSERT. 
Although the CPU load looked fine (and even IOWait wasn't out of the 
ordinary, 20% on one CPU) it seems to have blocked. Sacrificing 
consistency for speed by setting fsync=no in postgres the IOwait went 
down to 0.5% and I now have 100 flows, all of them with start and end!

BTW, could you give a quick test to this patch, yours seems to leak
memory since NFCT_CB_STOLEN means not to release the ct object (no
problem, I guess that you're not familiar with libnetfilter_conntrack).

Thanks. Yes, I'm even not that familiar with C :-)

Your patch compiles and runs fine. Can't tell much about memory leaks, 
but the system has not exploded yet.

Bernhard
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html