Hi, On Mon, Jul 17, 2017 at 05:27:16PM +0000, BOITEUX, Frederic wrote: > Hello, > > I have a problem concerning sctp i would like to submit you : on a Debian 8.0 server with 3.16.0 Linux kernel, using SCTP , we observe a soft lockup in sctp_assoc_update_retran_path : > > [ 724.633312] BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] > [ 724.633345] Modules linked in: hmac nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc vmw_vsock_vmci_transport vsock vmwgfx ttm drm_kms_helper drm vmw_balloon coretemp ppdev evdev i2c_piix4 serio_raw pcspkr crc32_pclmul i2c_core aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd vmw_vmci battery parport_pc parport shpchp processor thermal_sys ac button sctp libcrc32c crc32c_generic loop kkcore(O) autofs4 ext4 crc16 mbcache jbd2 dm_mod sr_mod cdrom sg ata_generic sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel psmouse ata_piix libata vmw_pvscsi scsi_mod vmxnet3 > [ 724.633376] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.16.0-4-amd64 #1 Debian 3.16.39-1+deb8u2 > [ 724.633377] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 > [ 724.633379] task: ffffffff8181a460 ti: ffffffff81800000 task.ti: ffffffff81800000 > [ 724.633380] RIP: 0010:[<ffffffffa01d1c96>] [<ffffffffa01d1c96>] sctp_assoc_update_retran_path+0x56/0xc0 [sctp] <<==== > [ 724.633388] RSP: 0018:ffff88023fc03c68 EFLAGS: 00000293 > [ 724.633389] RAX: ffff8800ba2ae400 RBX: 0000000000000000 RCX: 00000094d3160b1c > [ 724.633390] RDX: 0000000000000001 RSI: ffff8800ba2ae400 RDI: ffff8800ba2ae400 > [ 724.633391] RBP: ffff8800bb14c128 R08: ffffffff81610640 R09: 0000000000000001 > [ 724.633391] R10: 0000000000000003 R11: 0000000000000010 R12: ffff88023fc03bd8 Can you please include the rest here? Specially the call trace. Will be helpful to know how sctp_assoc_update_retran_path got called. > > It's similar to Redhat bug (https://access.redhat.com/solutions/2039183) but our kernel already have the fix for this problem. We hadn't the latest Debian kernel version, but carefully looking at its changelog, we don't see potential fix available. > > As in the Redhat bug report, we also use SCTP with multiple multi-homed endpoints, and are facing this bug during transient global network failure. > > In the sctp_assoc_update_retran_path(), we noted this loop : > > /* Iterate from retran_path's successor back to retran_path. */ > for (trans = list_next_entry(trans, transports); 1; > trans = list_next_entry(trans, transports)) { > /* Manually skip the head element. */ > if (&trans->transports == &asoc->peer.transport_addr_list) > continue; > if (trans->state == SCTP_UNCONFIRMED) > continue; > trans_next = sctp_trans_elect_best(trans, trans_next); > /* Active is good enough for immediate return. */ > if (trans_next->state == SCTP_ACTIVE) > break; > /* We've reached the end, time to update path. */ > if (trans == asoc->peer.retran_path) > break; > } > > We wonder if the lockup could occur if an association have multiple distant peers, all in UNCONFIRMED state ? Because in this case, the 'continue' statement prevent to reach the last test which break the loop, no ? It would seem so, yes, but maybe some other check could have avoided such situation. Nevertheless, that for() probably can be re-written using list_for_each_entry_continue(), would be safer. Would bring the last if() to the for condition and avoid infinite loops. > > We can't at now reproduce the problem in a deterministic way, limiting debug, but we would appreciate a lot your expert point of view about this problem. The calltrace may give us more hints on that. > > With regards, > Frédéric Boiteux. > > This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html