Le jeudi 09 juin 2011 Ã 01:02 +0800, Brad Campbell a Ãcrit : > On 08/06/11 11:59, Eric Dumazet wrote: > > > Well, a bisection definitely should help, but needs a lot of time in > > your case. > > Yes. compile, test, crash, walk out to the other building to press > reset, lather, rinse, repeat. > > I need a reset button on the end of a 50M wire, or a hardware watchdog! > > Actually it's not so bad. If I turn off slub debugging the kernel panics > and reboots itself. > > This.. : > [ 2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1 > [ 2.913066] netconsole: device eth0 not up yet, forcing it > [ 3.660062] Refined TSC clocksource calibration: 3213.422 MHz. > [ 3.660118] Switching to clocksource tsc > [ 63.200273] r8169 0000:03:00.0: eth0: unable to load firmware patch > rtl_nic/rtl8168e-1.fw (-2) > [ 63.223513] r8169 0000:03:00.0: eth0: link down > [ 63.223556] r8169 0000:03:00.0: eth0: link down > > ..is slowing down reboots considerably. 3.0-rc does _not_ like some > timing hardware in my machine. Having said that, at least it does not > randomly panic on SCSI like 2.6.39 does. > > Ok, I've ruled out TCPMSS. Found out where it was being set and neutered > it. I've replicated it with only the single DNAT rule. > > > > Could you try following patch, because this is the 'usual suspect' I had > > yesterday : > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 46cbd28..9f548f9 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > > fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; > > } > > > > +#if 0 > > if (fastpath&& > > size + sizeof(struct skb_shared_info)<= ksize(skb->head)) { > > memmove(skb->head + size, skb_shinfo(skb), > > @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > > off = nhead; > > goto adjust_others; > > } > > - > > +#endif > > data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); > > if (!data) > > goto nodata; > > > > > > > > Nope.. that's not it. <sigh> That might have changed the characteristic > of the fault slightly, but unfortunately I got caught with a couple of > fsck's, so I only got to test it 3 times tonight. > > It's unfortunate that this is a production system, so I can only take it > down between about 9pm and 1am. That would normally be pretty > productive, except that an fsck of a 14TB ext4 can take 30 minutes if it > panics at the wrong time. > > I'm out of time tonight, but I'll have a crack at some bisection > tomorrow night. Now I just have to go back far enough that it works, and > be near enough not to have to futz around with /proc /sys or drivers. > > I really, really, really appreciate you guys helping me with this. It > has been driving me absolutely bonkers. If I'm ever in the same town as > any of you, dinner and drinks are on me. Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not appropriate for production :( -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html