> Hello, > > We are developing an advanced networking services loadable module and are > having problems porting it to work on 2.4.x kernels. The driver is > supposed to provide services such as fault tolerance, load balancing and > link aggregation over a team of network adapters. It works OK on 2.2.x > kernels but hangs on 2.4.x kernels. > > In order to debug it, we stripped it down to become a mere "intermediate" > or "filter" driver that binds to a base driver and passes everything > through in both directions (Rx, Tx, IOCTL, stats, etc.). After going > through the basics of modifying the driver to compile on 2.4.x kernels and > fighting some nasty dead locks due to the new nature of the networking > layer, we managed to get it to run. The driver will receive and transmit a > few hundreds of thousands of packets (while having a periodic timer expire > 10 times a second and running continuous IOCTLs), and then it causes an > oops about not being able to handle a page fault. > > The function looks something like: > > int iansHardStartXmit(struct sk_buff *skb, struct net_device *dev) { > int res; > struct net_device *base; > > spin_lock(&lock); > base = get_base_driver_by_name(name); > > if(base != NULL) { > res = base->hard_start_xmit(skb, base); > } > > spin_unlock(&lock); > return res; > } > > We used kdb in order to track down the problem and found out the following > stack trace: > > EBP EIP function(args) > 0xc4cd1c54 0xd081e3e7 [e100]__kallsyms+0xb (0xc4b595a0, > 0xc840f200) > e100 __kallsyms 0xd081e3dc > 0xd081e3dc 0xd0820dsc > 0xd08244ba [ians]iansHardStartXmit+0xa6 (0xc4b595a0, > 0xc4d9bc00) > ians .text 0xd0824060 0xd0824414 > 0xd082452c > 0xc01f9d1f qdisc_restart+0xcf (0xc4d9bc00) > kernel .text 0xc0100000 0xc01f9c50 > 0xc01f9f14 > * > * > * > > This goes on and shows that this is an ICMP echo reply packet going down > through the IP stack to the filter driver (apparently 0xc4b595a0 is the > skb, 0xc4d9bc00 is the *dev of the filter driver and 0xc840f200 is the > *dev of the base driver). The filter driver is supposed to call the > dev->hard_start_xmit of the base driver, but strangely it lands somewhere > in the data segment of the base driver (__kallsyms is a part of the symbol > table of the module according to insmod -m). > Figuring the dev->hard_start_xmit pointer got trashed somehow, we added a > check to make sure the same pointer is always called, and indeed this was > the case. Looking at the assembly code with kdb, we could see that the > call to the base driver is done by a 'call *%eax' command. kdb reports > that eax=0xffffffff after the page fault (origeax). > > How is it possible that the pointer to the function keeps it's value, but > the jump to that function falls somewhere else ? > The entire function is protected by a spinlock, so there is no worry about > the other threads messing my data. > > We are using: > RedHat 6.2 > gcc v2.91.66 > modutils v2.3.11-1 > kernel linux-2.4.0-test9 > kdb v1.5-2.4.0-test9-pre9 > Compaq ap500 dual p-III Xeon > > > Thanks, > Shmulik Hen > > Software Engineer > Linux Advanced Networking Services > Network Communications Group, Israel (NCGj) > Intel Corporation Ltd. > > - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org