Mike Allport wrote: > I checked the patches you referenced versus the patches I created from > the official linux patches. What you presnet is quite different that > what I've got in my patches. My patches have FAR too many changes > when compared to your patches ( I think I 'over patched' ). > > Do you recommend patching our 2.6.14 kernel with the patches you > reference below ? I can easily do this and re-run our tests and > verify oops resolution. > > Thanks, > Mike > I just pointed you at the certain subsets of changes. If you have these changes in your code stream, then the problem is something different. If you code stream doesn't appear to have these changes, you might want to integrate them and see if your problem is fixed. These were just the race conditions. The first commit hash I listed (1bc4ee4088c9a502db0e9c87f675e61e57fa1734) is actually wrong and doesn't apply to you. Instead you might want to look at this commit: ea2bc483ff5caada7c4aa0d5fbf87d3a6590273d [SCTP]: Fix assertion (!atomic_read(&sk->sk_rmem_alloc)) failed message -vlad > > > > On Fri, Nov 20, 2009 at 6:47 AM, Vlad Yasevich > <vladislav.yasevich@xxxxxx> wrote: >> >> Mike Allport wrote: >>> This email outlines some kernel panics (oops) we've been getting and >>> would like to resolve. >>> >>> My question is, is there a patch or set of patches to the 2.6.14 >>> kernel known to resolve the oops shown in 1) and 2)? >>> >>> Unfotunately, I don't have the luxury of picking up a more up-rev'ed kernel. >>> >> Is your server using 1-to-1 socket (SOCK_STREAM) or doing peeloffs? >> >> The traces you've shows aren't familiar, but they look a bit like >> the association race traces. >> >> You want to make sure that you have the code from the following patches: >> >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h= >> 1bc4ee4088c9a502db0e9c87f675e61e57fa1734 >> ae53b5bd77719fed58086c5be60ce4f22bffe1c6 >> 027f6e1ad32de32f9fe1c61d0f744e329e8acfd9 >> cfdeef3282705a4b872d3559c4e7d2561251363c >> 61c9fed41638249f8b6ca5345064eb1beb50179f >> >> Put the above hash values after the 'h=' on the URL line and you will see the diff. >> >> -vlad >> >>> Background of our problem: >>> ---------------------------------------- >>> At the 2.6.14-7 kernel, we were getting the oops shown in 1) and 2) >>> frequently when running SCTP traffic to our server. >>> >>> In an effort resolve the oops shown in 1) and 2), we ported over only >>> the SCTP parts of the official Linux patches up to the 2.6.23 release >>> found at ftp.kernel.org. The patching seems to have resolved the oops >>> in 1) and 2), but introduced another set of oops which don't happen >>> 'often' and are shown in 3) and 4) below. >>> >>> >>> >>> 1) This is the oops gotten at the 2.6.14-7 kernel that is not patched >>> in the networking or SCTP areas. I have found some google hits on the >>> string at the bottom of this oops "KERNEL: >>> assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at >>> net/ipv4/af_inet.c (146)", but no resolutions offered. >>> >>> >>> atcafs-n0s11:~# Oops: 0000 [#1] >>> @SMP >>> @LTT NESTING LEVEL : 0 >>> @Modules linked in: sctp ip_queue iptable_filter ip_tables bonding >>> loop ohci_hcd i2c_i801 i2c_core ehci_hcd ipmi_watchdog ipmi_si >>> ipmi_devintf ipmi_msghandler softdog video thermal processor fan >>> button battery ac >>> @CPU: 1 >>> @EIP: 0060:[<f89f2e6f>] Not tainted VLI >>> @EFLAGS: 00010282 (2.6.14.7-selinux1-WR1.4aq_cgl) >>> @EIP is at sctp_getsockopt_sctp_status+0x100/0x1de [sctp] >>> @eax: 00000000 ebx: 000000b0 ecx: 00000000 edx: 00000000 >>> @esi: d5e02000 edi: d64f1640 ebp: d93f2e78 esp: d93f2dac >>> @ds: 007b es: 007b ss: 0068 >>> @Process upis (pid: 4388, threadinfo=d93f2000 task=d8d708d0) >>> @Stack: 00000000 00000000 87163ef0 d64f1640 00000000 00000001 0000ffff 00000000 >>> @ 00200020 00000000 36fc0002 00000000 00000000 00000000 00000000 00000000 >>> @ 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >>> @Call Trace: >>> @ [<c0103fad>] show_stack+0x7a/0x90 >>> @ [<c010412b>] show_registers+0x14f/0x1c7 >>> @ [<c010516b>] die+0x11a/0x195 >>> @ [<c042b54b>] do_page_fault+0xa77/0x360d >>> @ [<c0103cdb>] error_code+0x4f/0x54 >>> @ [<f89f43f5>] sctp_getsockopt+0x1ef/0x2a5 [sctp] >>> @ [<c03aabb7>] sock_common_getsockopt+0x22/0x2c >>> @ [<c03a7f6b>] sys_getsockopt+0x49/0x82 >>> @ [<c03a8e22>] sys_socketcall+0xa5a/0xa9b >>> @ [<c04239c4>] no_syscall_entry_trace+0xb/0xf >>> @Code: ff ff ff 0f b7 86 9e 00 00 00 66 89 85 54 ff ff ff 0f b7 86 9c >>> 00 00 00 66 89 85 56 ff ff ff 8b 86 8c 13 00 00 89 85 58 ff ff >>> ff <8b> 42 30 31 d2 >>> 85 c0 74 03 8b 50 7c 8b b5 38 ff ff ff 89 95 5c >>> @ idr_remove called for id=700 which is not allocated. >>> @ [<c0103fda>] dump_stack+0x17/0x19 >>> @ [<c029a158>] idr_remove_warning+0x1b/0x1d >>> @ [<c029a241>] sub_remove+0xe7/0xe9 >>> @ [<c029a266>] idr_remove+0x23/0x87 >>> @ [<f89e8be1>] sctp_association_destroy+0x64/0xa3 [sctp] >>> @ [<f89e9101>] sctp_association_put+0x19/0x1b [sctp] >>> @ [<f89e9377>] sctp_assoc_bh_rcv+0xd1/0x105 [sctp] >>> @ [<f89ed9ce>] sctp_inq_push+0x18/0x1a [sctp] >>> @ [<f89f6660>] sctp_backlog_rcv+0x11/0x15 [sctp] >>> @ [<c03aa40b>] __release_sock+0x47/0x6a >>> @ [<c03aaac8>] release_sock+0x55/0x90 >>> @ [<f89f170d>] sctp_close+0xa6/0x111 [sctp] >>> @ [<c03efd50>] inet_release+0x37/0x5b >>> @ [<c03a4ac7>] sock_release+0x4c/0x9f >>> @ [<c03a6a54>] sock_close+0x21/0x3d >>> @ [<c017a4cf>] __fput+0x147/0x172 >>> @ [<c017a386>] fput+0x19/0x1b >>> @ [<c01731fd>] filp_close+0x3c/0x75 >>> @ [<c0173589>] sys_close+0x353/0x7a9 >>> @ [<c04239c4>] no_syscall_entry_trace+0xb/0xf >>> @KERNEL: assertion (!atomic_read(&sk->sk_wmem_alloc)) failed at >>> net/ipv4/af_inet.c (146) >>> >>> >>> >>> 2) Below is another oops flaver we've seen logged to the serial port >>> while running the 2.6.14 kernel (not patched in the network or sctp >>> areas) >>> >>> atcafs-n0s6:~# Oops: 0000 [#1] >>> SMP >>> LTT NESTING LEVEL : 0 >>> Modules linked in: sctp ip_queue iptable_filter ip_tables bonding loop >>> ohci_hcd i2c_i801 i2c_core ehci_hcd ipmi_watchdog ipmi_si ipmi_devintf >>> ipmi_msghandler softdog video thermal processor fan button battery ac >>> CPU: 1 >>> EIP: 0060:[<f89f2e6f>] Not tainted VLI >>> EFLAGS: 00010282 (2.6.14.7-selinux1-WR1.4aq_cgl) >>> EIP is at sctp_getsockopt_sctp_status+0x100/0x1de [sctp] >>> eax: 00000000 ebx: 000000b0 ecx: 00000000 edx: 00000000 >>> esi: d8a7c000 edi: d9094940 ebp: d8cbae78 esp: d8cbadac >>> ds: 007b es: 007b ss: 0068 >>> Process upis (pid: 4177, threadinfo=d8cba000 task=d8d53830) >>> Stack: 00000000 00000000 87180ef0 d9094940 00000000 00000001 0000ffff 00000000 >>> 00200020 00000000 72ed0002 00000000 00000000 00000000 00000000 00000000 >>> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >>> Call Trace: >>> [<c0103fad>] show_stack+0x7a/0x90 >>> [<c010412b>] show_registers+0x14f/0x1c7 >>> [<c010516b>] die+0x11a/0x195 >>> [<c042b54b>] do_page_fault+0xa77/0x360d >>> [<c0103cdb>] error_code+0x4f/0x54 >>> [<f89f43f5>] sctp_getsockopt+0x1ef/0x2a5 [sctp] >>> [<c03aabb7>] sock_common_getsockopt+0x22/0x2c >>> [<c03a7f6b>] sys_getsockopt+0x49/0x82 >>> [<c03a8e22>] sys_socketcall+0xa5a/0xa9b >>> [<c04239c4>] no_syscall_entry_trace+0xb/0xf >>> Code: ff ff ff 0f b7 86 9e 00 00 00 66 89 85 54 ff ff ff 0f b7 86 9c >>> 00 00 00 66 89 85 56 ff ff ff 8b 86 8c 13 00 00 89 85 58 ff ff >>> ff <8b> 42 30 31 d2 85 c0 74 >>> 03 8b 50 7c 8b b5 38 ff ff ff 89 95 5c >>> >>> atcafs-n0s6:~# >>> >>> >>> >>> >>> 3) after porting SCTP parts of the ftp.kernel.org official patches up >>> to the 2.6.23 relase to our kernel, we now get these oops... >>> >>> >>> The follwing oops did not lock up the computer and did stop the >>> computer from accepting SCTP associations (every association attempt >>> from a client was answered with an ABORT). >>> >>> >>> atcafs-n0s5:~# Oops: 0000 [#1] >>> SMP >>> LTT NESTING LEVEL : 0 >>> Modules linked in: sctp ip_queue iptable_filter ip_tables bonding loop ohci_hcdc >>> CPU: 1 >>> EIP: 0060:[<f89f8c8c>] Not tainted VLI >>> EFLAGS: 00010246 (2.6.14.7-selinux1-WR1.4aq_cgl) >>> EIP is at sctp_getsockopt_sctp_status+0xe8/0x1f7 [sctp] >>> eax: 00000000 ebx: 00000000 ecx: 00000000 edx: 00000000 >>> esi: d81d0000 edi: d70cb700 ebp: d896fe78 esp: d896fdac >>> ds: 007b es: 007b ss: 0068 >>> Process upis (pid: 22955, threadinfo=d896f000 task=d8cab5b0) >>> Stack: 00000000 870f3ed0 000000b0 d70cb700 00000000 00000001 0000ffff 00000000 >>> 00200020 00000000 60450002 00000000 00000000 00000000 00000000 00000000 >>> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >>> Call Trace: >>> [<c0103fad>] show_stack+0x7a/0x90 >>> [<c010412b>] show_registers+0x14f/0x1c7 >>> [<c010516b>] die+0x11a/0x195 >>> [<c042b54b>] do_page_fault+0xa77/0x360d >>> [<c0103cdb>] error_code+0x4f/0x54 >>> [<f89fa5af>] sctp_getsockopt+0x1ef/0x322 [sctp] >>> [<c03aabb7>] sock_common_getsockopt+0x22/0x2c >>> [<c03a7f6b>] sys_getsockopt+0x49/0x82 >>> [<c03a8e22>] sys_socketcall+0xa5a/0xa9b >>> [<c04239c4>] no_syscall_entry_trace+0xb/0xf >>> Code: ff ff ff 0f b7 86 9a 00 00 00 66 89 85 54 ff ff ff 0f b7 86 98 00 00 00 6 >>> >>> >>> >>> 4) Then, after the above oops happened, issue the 'cat >>> /proc/net/sctp/assocs' on this same computer, now the computer will >>> lock up after dumping the following oops to the serial port. >>> >>> >>> >>> atcafs-n0s5:~# cat /proc/net/sctp/assocs >>> >>> >>> ...then the lock up... >>> >>> >>> ASSOC SOCK <1>Unable to handle kernel NULL pointer dereference STY SST STc >>> printing eip: >>> T ASSOC-ID TX_QUf89ec250 >>> *pde = 00000000 >>> EUE RX_QUEUE UIDOops: 0000 [#2] >>> SMP >>> LTT NESTING LEVEL : 0 >>> Modules linked in: sctp ip_queue iptable_filter ip_tables bonding loop ohci_hcdc >>> CPU: 3 >>> INODE LPORT RPOEIP: 0060:[<f89ec250>] Not tainted VLI >>> EFLAGS: 00010206 (2.6.14.7-selinux1-WR1.4aq_cgl) >>> RT LADDRS <-> RAEIP is at sctp_v4_cmp_addr+0x3/0x2f [sctp] >>> eax: d7824c10 ebx: d7824c00 ecx: d7824c10 edx: 0000005c >>> DDRS >>> d8c36000 desi: f8a0a680 edi: d7824c10 ebp: d810bdb0 esp: d810bd88 >>> ds: 007b es: 007b ss: 0068 >>> Process cat (pid: 15602, threadinfo=d810b000 task=d6d0b450) >>> Stack: d810bdb0 f89fd7ea d78f61ae d6e9ea00 d81d0064 0000005c d6e9ea00 d70cb700 >>> 00a7f18d d81d0000 d810be0c f89fdc58 d6e9ea00 f8a002e7 d81d0000 d70cb700 >>> 00000002 00000001 00000001 0000f425 00000000 00000000 00000000 00000000 >>> Call Trace: >>> [<c0103fad>] show_stack+0x7a/0x908d83700 2 10 >>> [<c010412b>] show_registers+0x14f/0x1c7 >>> [<c010516b>] die+0x11a/0x195 >>> [<c042b54b>] do_page_fault+0xa77/0x360d >>> [<c0103cdb>] error_code+0x4f/0x54 >>> [<f89fdc58>] 1 6499 1523 sctp_assocs_seq_show+0xf5/0x146 [sctp] >>> [<c019a8a2>] seq_read+0x1f8/0x28e 0 229 >>> [<c0175187>] vfs_read+0xc4/0x169 0 96474 140 >>> [<c0175812>] sys_read+0x371/0x132c >>> [<c04239c4>] no_syscall_entry_trace+0xb/0xf >>> Code: 00 08 8b 40 04 89 e5 5d 89 42 04 b8 08 00 00 00 c3 55 66 c7 00 02 00 66 8 >>> 01 43211 *10.6.<0>Kernel panic - not syncing: Fatal exception in interrupt >>> 48.5 <-> *62.11. >>> >>> >>> Thanks, >>> Mike Allport >>> >>> ------------------------------------------------------------------------------ >>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day >>> trial. Simplify your report design, integration and deployment - and focus on >>> what you do best, core application coding. Discover what's new with >>> Crystal Reports now. http://p.sf.net/sfu/bobj-july >>> _______________________________________________ >>> Lksctp-developers mailing list >>> Lksctp-developers@xxxxxxxxxxxxxxxxxxxxx >>> https://lists.sourceforge.net/lists/listinfo/lksctp-developers >>> > -- To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html