HTB kernel panic crash!

Trevor Cordes <lartc@xxxxxxxxxxxxx> · Tue, 17 Aug 2004 13:47:31 -0500 (CDT)

(list admin, please cancel the same post from my other email address --
forgot to change it on first submission)

I need to setup QoS on a linux router/firewall I maintain.  I spent 10
hours reading everything I could find on QoS/HTB/iproute2 and came up with
what I thought made sense for my situation.  So I deployed it and BOOM!
KERNEL PANIC!  Not what I was expecting... now the debugging begins.

I reproduced the panic twice on two different (yet almost identically
configured) machines.  I can reproduce the panic on demand by doing a
specific set of actions.

First, my setup:

I have 2 machines at different locations connected via internet.  Both
machines are stock Fedora Core 1 kernel 2.4.22-1.2179.nptl.  I run
free/SWAN (stock FC binary rpm's) between the 2 machines for ipsec VPN.
I run VoIP, VNC and all other inter-office traffic through the VPN.  The
internet connection is ADSL with 400kbits/s up and 1500 or so down.  VoIP
is routed but not MASQ'd.  VNC is MASQ'd (neither the originating nor
destination machines are the linux boxes themseleves).

Second, my goals:

Give a fixed minimum bandwidth and high priority to VoIP through VPN.
Same, but less so, for VNC through VPN.  Give the VPN high enough
allocation for VoIP and VNC to get through ok.  Less important little
tweaks for rarely-used outside (non IPSEC) VNC and ssh access.

My situation seems different from the examples I've seen because *I
believe* I need to have 2 completely separate qdiscs, 1 for ppp0 (the
DSL) and 1 for ipsec0 (the freeSWAN VPN).  Yet ipsec0 eventually goes over
ppp0 so they are intertwined.  I have a funny feeling this is where the
crash is coming from.

See my setup script near the bottom of this email (excuse the wrapping).

Everything seemed to go great until I tried VNC'ing in from one office to
the other.  The VNC screen would pop up, do a first draw, then completely
freeze.  From that point on the remote linux router is frozen -- kernel
panic.  Strange that the bug would only trigger AFTER sending the
100-200kB of the initial VNC screen.

Looking at my config, I will note a couple of questions I had while
writing it that weren't answered in the docs I found:

1. The "tc filter add ... protocol ip" thing confused me.  What exactly is
the "protocol ip" for?  I originally though that it should read "protocol
50" for the ipsec stuff, but that didn't seem to catch the packets, so I
switched it back to "ip".  Weird, while testing with it set to 50 (and
having no packets match the rule) there were no crashes.

2. The iptables mangle rules will in the case of VNC and ssh *over VPN*
match two rules.  I *assume* the last executing MARK will overwrite the
previous MARK.  If for some reason the marks are ANDed or something,
perhaps that is causing the crash (filtering 1 packet into 2 buckets?).

3. As I mentioned above, the fact that one qdisc will feed a separate
qdisc, because ipsec0 eventually goes out over ppp0, may be a problem?  I
wish I had seen some examples of this type of setup.

4. I chose HTB instead of CBQ as it seemed simpler (always a good thing)
and more suited to my exact needs.  Not sure if the bug is in HTB itself
or the general QoS stuff.

my setup script:

  $iext=ppp0
  $isec=ipsec0
  $ivoi=eth3
  $qosbw=380

  # VNC
  iptables -t mangle -A PREROUTING -p tcp --sport 5900 -j MARK --set-mark
11
  iptables -t mangle -A PREROUTING -p tcp --dport 5900 -j MARK --set-mark
11
  iptables -t mangle -A PREROUTING  -i $ivoi -j MARK --set-mark 10
  iptables -t mangle -A OUTPUT -p 50 -j MARK --set-mark 10
  iptables -t mangle -A OUTPUT -p 51 -j MARK --set-mark 10
  iptables -t mangle -A OUTPUT -o $iext -p tcp --sport ssh -j MARK
--set-mark 12

  tc qdisc  del dev $isec root >/dev/null 2>&1
  tc qdisc  add dev $isec root handle 1:0 htb default 13
  tc class  add dev $isec parent 1:0 classid 1:1  htb rate "$qosbw"kbit
ceil "$qosbw"kbit
  tc class  add dev $isec parent 1:1 classid 1:10 htb rate 160kbit
ceil "$qosbw"kbit
  tc class  add dev $isec parent 1:1 classid 1:11 htb rate 210kbit
ceil "$qosbw"kbit
  tc class  add dev $isec parent 1:1 classid 1:13 htb rate 010kbit
ceil "$qosbw"kbit
  tc qdisc  add dev $isec parent 1:10 handle 110:0 sfq perturb 10
  tc qdisc  add dev $isec parent 1:11 handle 111:0 sfq perturb 10
  tc qdisc  add dev $isec parent 1:13 handle 113:0 sfq perturb 10
  tc filter add dev $isec parent 1:0 protocol ip handle 10 fw flowid 1:10
  tc filter add dev $isec parent 1:0 protocol ip handle 11 fw flowid 1:11

  tc qdisc  del dev $iext root >/dev/null 2>&1
  tc qdisc  add dev $iext root handle 1:0 htb default 13
  tc class  add dev $iext parent 1:0 classid 1:1  htb rate "$qosbw"kbit
ceil "$qosbw"kbit
  tc class  add dev $iext parent 1:1 classid 1:10 htb rate 300kbit
ceil "$qosbw"kbit
  tc class  add dev $iext parent 1:1 classid 1:11 htb rate 050kbit
ceil "$qosbw"kbit
  tc class  add dev $iext parent 1:1 classid 1:12 htb rate 020kbit
ceil "$qosbw"kbit
  tc class  add dev $iext parent 1:1 classid 1:13 htb rate 010kbit
ceil "$qosbw"kbit
  tc qdisc  add dev $iext parent 1:10 handle 110:0 sfq perturb 10
  tc qdisc  add dev $iext parent 1:11 handle 111:0 sfq perturb 10
  tc qdisc  add dev $iext parent 1:12 handle 112:0 sfq perturb 10
  tc qdisc  add dev $iext parent 1:13 handle 113:0 sfq perturb 10
  tc filter add dev $iext parent 1:0 protocol ip handle 10 fw flowid 1:10
  tc filter add dev $iext parent 1:0 protocol ip handle 11 fw flowid 1:11
  tc filter add dev $iext parent 1:0 protocol ip handle 12 fw flowid 1:12

The info dumped on-screen from the kernel panic.  I couldn't find any way
to scroll up and didn't have sysrq enabled and didn't have the ability to
enable and reproduce (system was being used live in production during
business hours!).  I could potentially go back off-hours and reproduce
with sysrq and get more info (hopefully).  There may be slight typos as
this was manually copied to paper and then back into this email!

... anything above not visible...
eax 0   wbx 8   ecx 1  edx d741c001

esi c0384000  edi 0  ebp d741c009  esp c0385ef8

ds 68  es 68  ss 60

process (pid 0  stackpage c0385000)

stack
ddfc9000  ddfc9244  c0385f40  0  1  c0385f8c  0  c0125e93
c038400  0  1  0  c0385f8c  20000001  c0126278  0
c010ed90  c0385f8c  c033ddb0  20000001  c010b1a5  0  0  c0385f8e

call trace
c0125e93 update_process_times   [k] 0x33 0xc0385f14
c0126278 do_timer               [k] 0x28 0xc0385f30
c010ed90 timer_interrupt        [k] 0x80 0xc0385f38
c010b1a5 handle_IRQ_event       [k] 0x45 0xc0385f48
c010b324 do_IRQ                 [k] 0x64 0xc0385f68
c010db28 call_do_IRQ            [k] 0x05 0xc0385f88
c0110068 restore_i387           [k] 0x28 0xc0385fa8
c0106fb3 default_idle           [k] 0x23 0xc0385fb4
c0115a7c apm_cpu_idle           [k] 0xac 0xc0385fc0
c01159d0 apm_cpu_idle           [k] 0x00 0xc0385fc4
c0107032 cpu_idle               [k] 0x32 0xc0385fd4
c0105000 stext                  [k] 0x00 0xc0385fe0

code
01 b8 d0 01 00 00 01 88 d4 01 00 00 b8 1f 85 eb 51 89 96 c4

<0> kernel panic: Aiee, killing int handler!
In interrupt handler not syncing

Lastly, thanks!

_______________________________________________
LARTC mailing list / LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/

HTB kernel panic crash!

Linux Advanced Routing and Traffic Control