Hello, I am experiencing a rather big problem with deadlocks on a 9 nodes GFS1 cluster, with vanilla 2.6.27 and both rhcs 2.03.09 and latest git stable2. Fencing is done via fabric, the node keeps throwing these errors after it got fenced. This is a rather busy webserver cluster, with usually some dozens to hundreds of apache processes running concurrently, and 4 gfs1 shares with lots of small writes on the "template cache" volume from all 9 nodes. Lockups look like this: [44955.425003] BUG: soft lockup - CPU#2 stuck for 61s! [apache:12639] [44955.425007] Modules linked in: gfs ac battery ipv6 iptable_filter xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables lock_dlm gfs2 dlm configfs snd_pcm snd_timer snd soundcore snd_page_alloc rtc_cmos rtc_core i2c_nforce2 k8temp shpchp rtc_lib pcspkr pci_hotplug i2c_core button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom amd74xx sd_mod ide_pci_generic ide_core floppy qla2xxx scsi_transport_fc 3w_9xxx e1000e scsi_tgt ata_generic sata_nv forcedeth libata ehci_hcd scsi_mod dock ohci_hcd thermal processor fan thermal_sys [44955.425007] irq event stamp: 0 [44955.425007] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [44955.425007] hardirqs last disabled at (0): [<ffffffff8023d7df>] copy_process+0x543/0x12b4 [44955.425007] softirqs last enabled at (0): [<ffffffff8023d7df>] copy_process+0x543/0x12b4 [44955.425007] softirqs last disabled at (0): [<0000000000000000>] 0x0 [44955.425007] CPU 2: [44955.425007] Modules linked in: gfs ac battery ipv6 iptable_filter xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables lock_dlm gfs2 dlm configfs snd_pcm snd_timer snd soundcore snd_page_alloc rtc_cmos rtc_core i2c_nforce2 k8temp shpchp rtc_lib pcspkr pci_hotplug i2c_core button evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod ide_cd_mod cdrom amd74xx sd_mod ide_pci_generic ide_core floppy qla2xxx scsi_transport_fc 3w_9xxx e1000e scsi_tgt ata_generic sata_nv forcedeth libata ehci_hcd scsi_mod dock ohci_hcd thermal processor fan thermal_sys [44955.425007] Pid: 12639, comm: apache Not tainted 2.6.27-2-amd64 #1 [44955.425007] RIP: 0010:[<ffffffff8021759b>] [<ffffffff8021759b>] native_read_tsc+0x6/0x18 [44955.425007] RSP: 0018:ffff880214af9d80 EFLAGS: 00000202 [44955.425007] RAX: 0000000000000000 RBX: 00000000498fb129 RCX: ffffffff8085d300 [44955.425007] RDX: 000062bb00000000 RSI: 0000000001062560 RDI: 0000000000000001 [44955.425007] RBP: 0000000000000002 R08: 0000000000000002 R09: 0000000000000000 [44955.425007] R10: 0000000000000000 R11: ffffffff8033dd3e R12: ffff88041f0b0000 [44955.425007] R13: ffff8802abb76000 R14: ffff880214af8000 R15: ffffffff8085a890 [44955.425007] FS: 00007f3e8ea7d6d0(0000) GS:ffff88041f0c9940(0000) knlGS:0000000000000000 [44955.425007] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [44955.425007] CR2: 00007f3e8e9fc000 CR3: 0000000214adf000 CR4: 00000000000006e0 [44955.425007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [44955.425007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [44955.425007] [44955.425007] Call Trace: [44955.425007] [<ffffffff8033dd53>] ? delay_tsc+0x15/0x45 [44955.425007] [<ffffffff80341333>] ? _raw_spin_lock+0x98/0x100 [44955.425007] [<ffffffff8045b3ce>] ? _spin_lock+0x4e/0x5a [44955.425007] [<ffffffff802c47dd>] ? igrab+0x10/0x36 [44955.425007] [<ffffffff802c47dd>] ? igrab+0x10/0x36 [44955.425007] [<ffffffffa0394971>] ? gfs_getattr+0x83/0xb7 [gfs] [44955.425007] [<ffffffff802b5846>] ? vfs_getattr+0x1a/0x5e [44955.425007] [<ffffffff802b59f6>] ? vfs_stat_fd+0x2f/0x43 [44955.425007] [<ffffffff802b5a66>] ? sys_newstat+0x19/0x31 [44955.425007] [<ffffffff8020ff7a>] ? system_call_fastpath+0x16/0x1b Best regards Frederik Schüler -- ENOSIG
Attachment:
signature.asc
Description: This is a digitally signed message part.
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster