Wow, a blast from the past :-) I'm the original author of the hash code, and since I don't use ipv6 (yet) I must have missed it in my testing. On Fri, 2007-06-22 at 14:36 -0700, Masayuki Nakagawa wrote: > I found an issue regarding networking in the real-time patch (patch-2.6.21.5-rt17). > The issue happens only with the kernel, which the real-time patch was applied. > However, the latest stable main kernel (2.6.21.5) doesn't have the same issue. > Therefore, please don't transfer this report to netdev. > > The detail of issue is below. > I ran my test program, which is a very simple IPv6 client-server program. > These programs establish a TCP/IPv6 connection between two hosts, and then sleep, > like following diagram. > And then, the problem appears with "cat /proc/net/tcp6". > > serverA serverB > | SYN | > +--------------->+ > | SYN/ACK | > +<---------------+ > | ACK | > +--------------->+ > | | > sleep... sleep... > | | > > When I "cat /proc/net/tcp6" on serverA while establishing connection between serverA and B, > the established connections are not shown. > If you need my test program, please let me know. I can provide it to you. > > However, in case of the main-line kernel, the established connections will be > shown appropriately with "cat /proc/net/tcp6". It's different because the > real-time patch has implemented a new socket lookup mechanism for a high-latency. > So, real-time patch has a different mechanism from main-line kernel. > > The real-time patch, which implemented a new socket lookup mechanism is using > bitmap(ebitmask). When establishing TCP connection, it sets a flag bit into > the bitmap like followings. > > [ebitmask in struct inet_hashinfo] > Before connecting > 0000000000000000000000000000000000000000000000000000000000000000 > After connecting > 0000001000000000000000000000000000000000000000000000000000000000 > ^ > > And when reading "/proc/net/tcp and tcp6", the kernel searches the currently active > TCP connections with reference to the bitmap. > > However, the kernel can't search the active TCP/IPv6 connection in established state. > It is because the kernel doesn't set a flag bit when establishing TCP/IPv6 connection. > In case of TCP/IPv4, __inet_hash() sets the flag bit properly with __inet_hash_setbit(). > But, in case of TCP/IPv6, the setting the flag bit is missing in __inet6_hash(). > > [include/net/inet_hashtables.h] > static inline void __inet_hash(struct inet_hashinfo *hashinfo, > struct sock *sk, const int listen_possible) > { > struct hlist_head *list; > rwlock_t *lock; > unsigned long *bitmask = NULL; > unsigned int index = 0; > > BUG_TRAP(sk_unhashed(sk)); > if (listen_possible && sk->sk_state == TCP_LISTEN) { > list = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)]; > lock = &hashinfo->lhash_lock; > inet_listen_wlock(hashinfo); > } else { > struct inet_ehash_bucket *head; > sk->sk_hash = inet_sk_ehashfn(sk); > index = inet_ehash_index(hashinfo, sk->sk_hash); > head = inet_ehash_bucket(hashinfo, sk->sk_hash); > list = &head->chain; > lock = &head->lock; > bitmask = hashinfo->ebitmask; > write_lock(lock); > } > __sk_add_node(sk, list); > __inet_hash_setbit(bitmask, index); > sock_prot_inc_use(sk->sk_prot); > write_unlock(lock); > if (listen_possible && sk->sk_state == TCP_LISTEN) > wake_up(&hashinfo->lhash_wait); > } > > [net/ipv6/inet6_hashtables.c] > void __inet6_hash(struct inet_hashinfo *hashinfo, > struct sock *sk) > { > struct hlist_head *list; > rwlock_t *lock; > > printk("__inet6_hash hit\n"); > > BUG_TRAP(sk_unhashed(sk)); > > if (sk->sk_state == TCP_LISTEN) { > list = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)]; > lock = &hashinfo->lhash_lock; > inet_listen_wlock(hashinfo); > } else { > unsigned int hash; > sk->sk_hash = hash = inet6_sk_ehashfn(sk); > hash &= (hashinfo->ehash_size - 1); > list = &hashinfo->ehash[hash].chain; > lock = &hashinfo->ehash[hash].lock; > write_lock(lock); > } > > __sk_add_node(sk, list); > sock_prot_inc_use(sk->sk_prot); > write_unlock(lock); > } > > So, I suggest a following change. > The change is to set the flag bit appropriately in __inet6_hash(). > > Signed-off-by: Masayuki Nakagawa <nakagawa.msy@xxxxxxxxxxxxxx> > Very nice investigation and solution. > > Index: linus-kernel.git/net/ipv6/inet6_hashtables.c > =================================================================== > --- linus-kernel.git.orig/net/ipv6/inet6_hashtables.c > +++ linus-kernel.git/net/ipv6/inet6_hashtables.c > @@ -27,6 +27,8 @@ void __inet6_hash(struct inet_hashinfo * > { > struct hlist_head *list; > rwlock_t *lock; > + unsigned long *bitmask = NULL; > + unsigned int index = 0; > > BUG_TRAP(sk_unhashed(sk)); > > @@ -35,15 +37,16 @@ void __inet6_hash(struct inet_hashinfo * > lock = &hashinfo->lhash_lock; > inet_listen_wlock(hashinfo); > } else { > - unsigned int hash; > - sk->sk_hash = hash = inet6_sk_ehashfn(sk); > - hash &= (hashinfo->ehash_size - 1); > - list = &hashinfo->ehash[hash].chain; > - lock = &hashinfo->ehash[hash].lock; > + sk->sk_hash = inet6_sk_ehashfn(sk); > + index = inet_ehash_index(hashinfo, sk->sk_hash); > + list = &hashinfo->ehash[index].chain; > + lock = &hashinfo->ehash[index].lock; > + bitmask = hashinfo->ebitmask; > write_lock(lock); > } > > __sk_add_node(sk, list); > + __inet_hash_setbit(bitmask, index); > sock_prot_inc_use(sk->sk_prot); > write_unlock(lock); > } I spent a few minutes looking at your changes and I don't see anything wrong with it. Acked-by: Steven Rostedt <rostedt@xxxxxxxxxxx> Looking at the original patch, here's the reason for the bitmask: --- linux-2.6.21.orig/net/ipv4/tcp_ipv4.c 2007-06-17 17:19:02.000000000 +0200 +++ linux-2.6.21/net/ipv4/tcp_ipv4.c 2007-06-17 17:20:27.000000000 +0200 @@ -2033,7 +2033,12 @@ static void *established_get_first(struc struct tcp_iter_state* st = seq->private; void *rc = NULL; - for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) { The above is a linear search through out a very large array, where most of the items are NULL. I believe it was Lee that noticed this creating a large latency. This was back in 2.6.14. I'll check to see if this still is a source of latency with the latest kernels. -- Steve + for (st->bucket = find_first_bit(tcp_hashinfo.ebitmask, + tcp_hashinfo.ehash_size); + st->bucket < tcp_hashinfo.ehash_size; + st->bucket = find_next_bit(tcp_hashinfo.ebitmask, + tcp_hashinfo.ehash_size, + st->bucket+1)) { - To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html