Hello, I'm looking for some pointers and advice. Is anybody around here using the Cluster Suite and GFS on Ubuntu? If so, which versions? Ubuntu packages or compiled from sources? I'm using packages from Ubuntu Hardy (not the latest, but a Long Term Support edition of Ubuntu), the rhcs related packages all have a version "2.20080227". I'm having some "BUG: soft lockup - CPU#1 stuck for Xs" problems with GFS (version 1). Any strong recommendations to use the latest source from generation 2, instead of the distro packages? ( --- if you're feeling like reading some logs, here is more --- ) It's a two node cluster, using brocade SAN switch fencing (no network power switch or IPMI for doing some power fencing). By now I don't have any services (rgmanager) configured, just the nodes. Unfortunally, it's already production use, so I'm very limited in playing and testing right now. The said lockup usually happens when I try to umount GFS (by normally halting the system) in a system which was fenced (the brocade fence). But monday it happened when I tried to mount GFS in a clean boot state, with the fc switch ports enabled, and AFAICT a perfect running cluster manager /fence domain up to that point. Related logging (stripped down, can send full log if anybody wants to see it): ============================ **** ==== In the machine that I did the mount, from that point on... Apr 6 18:41:31 <node2> kernel: [ 9101.559599] Trying to join cluster "lock_dlm", "srvwebGFS:www" Apr 6 18:41:31 <node2> kernel: [ 9101.559857] dlm: Using TCP for communications Apr 6 18:41:31 <node2> kernel: [ 9101.564627] dlm: connecting to 1 Apr 6 18:41:31 <node2> kernel: [ 9101.618713] dlm: got connection from 1 Apr 6 18:41:31 <node2> kernel: [ 9101.618914] Joined cluster. Now mounting FS... Apr 6 18:41:51 <node2> snmpd[5965]: Connection from UDP: [172.16.8.10]:53683 Apr 6 18:41:59 <node2> kernel: [ 9129.492405] GFS: fsid=srvwebGFS:www.1: jid=1: Trying to acquire journal lock... Apr 6 18:41:59 <node2> kernel: [ 9129.492414] GFS: fsid=srvwebGFS:www.1: jid=1: Looking at journal... Apr 6 18:41:59 <node2> kernel: [ 9129.554362] GFS: fsid=srvwebGFS:www.1: jid=1: Done Apr 6 18:41:59 <node2> kernel: [ 9129.557633] GFS: fsid=srvwebGFS:www.1: Scanning for log elements... Apr 6 18:42:15 <node2> kernel: [ 9140.874965] BUG: soft lockup - CPU#1 stuck for 11s! [mount.gfs:7188] Apr 6 18:42:15 <node2> kernel: [ 9140.874997] CPU 1: (... long module listing ...) Apr 6 18:42:15 <node2> kernel: [ 9140.875078] Pid: 7188, comm: mount.gfs Not tainted 2.6.24-23-server #1 Apr 6 18:42:15 <node2> kernel: [ 9140.875082] RIP: 0010:[gfs:gfs_unlinked_get+0x5c/0x190] [gfs:gfs_unlinked_get+0x5c/0x190] :gfs:gfs_unlinked_get+0x5c/0x190 (... registers sutff ...) Apr 6 18:42:15 <node2> kernel: [ 9140.875147] Call Trace: Apr 6 18:42:15 <node2> kernel: [ 9140.875171] [gfs:gfs_unlinked_get+0x35/0x190] :gfs:gfs_unlinked_get+0x35/0x190 Apr 6 18:42:15 <node2> kernel: [ 9140.875216] [gfs:gfs_unlinked_merge+0x47/0x1e0] :gfs:gfs_unlinked_merge+0x47/0x1e0 Apr 6 18:42:15 <node2> kernel: [ 9140.875252] [gfs:unlinked_scan_elements+0xc9/0x1b0] :gfs:unlinked_scan_elements+0xc9/0x1b0 Apr 6 18:42:15 <node2> kernel: [ 9140.875312] [gfs:foreach_descriptor+0x155/0x340] :gfs:foreach_descriptor+0x155/0x340 Apr 6 18:42:15 <node2> kernel: [ 9140.875455] [gfs:gfs_recover_dump+0x121/0x1d0] :gfs:gfs_recover_dump+0x121/0x1d0 Apr 6 18:42:15 <node2> kernel: [ 9140.875543] [gfs:gfs_make_fs_rw+0x114/0x150] :gfs:gfs_make_fs_rw+0x114/0x150 Apr 6 18:42:15 <node2> kernel: [ 9140.875657] [gfs:init_journal+0x3fe/0x4b0] :gfs:init_journal+0x3fe/0x4b0 Apr 6 18:42:15 <node2> kernel: [ 9140.875727] [gfs:fill_super+0x554/0x6a0] :gfs:fill_super+0x554/0x6a0 Apr 6 18:42:15 <node2> kernel: [ 9140.875767] [set_bdev_super+0x0/0x10] set_bdev_super+0x0/0x10 Apr 6 18:42:15 <node2> kernel: [ 9140.875791] [fuse:get_sb_bdev+0x14b/0x180] get_sb_bdev+0x14b/0x180 Apr 6 18:42:15 <node2> kernel: [ 9140.875817] [gfs:fill_super+0x0/0x6a0] :gfs:fill_super+0x0/0x6a0 Apr 6 18:42:15 <node2> kernel: [ 9140.875853] [vfs_kern_mount+0xcc/0x160] vfs_kern_mount+0xcc/0x160 Apr 6 18:42:15 <node2> kernel: [ 9140.875877] [do_kern_mount+0x53/0x110] do_kern_mount+0x53/0x110 Apr 6 18:42:15 <node2> kernel: [ 9140.875902] [do_mount+0x546/0x810] do_mount+0x546/0x810 Apr 6 18:42:15 <node2> kernel: [ 9140.875934] [n_tty_receive_buf+0x335/0xf30] n_tty_receive_buf+0x335/0xf30 Apr 6 18:42:15 <node2> kernel: [ 9140.875953] [n_tty_receive_buf+0x335/0xf30] n_tty_receive_buf+0x335/0xf30 Apr 6 18:42:15 <node2> kernel: [ 9140.875993] [zone_statistics+0x7d/0x80] zone_statistics+0x7d/0x80 Apr 6 18:42:15 <node2> kernel: [ 9140.876043] [tty_default_put_char+0x1d/0x30] tty_default_put_char+0x1d/0x30 Apr 6 18:42:15 <node2> kernel: [ 9140.876128] [configfs:__get_free_pages+0x1b/0x40] __get_free_pages+0x1b/0x40 Apr 6 18:42:15 <node2> kernel: [ 9140.876152] [sys_mount+0x9b/0x100] sys_mount+0x9b/0x100 Apr 6 18:42:15 <node2> kernel: [ 9140.876183] [system_call+0x7e/0x83] system_call+0x7e/0x83 (... more and more of CPUs 0-3 soft locking and stuck for x seconds, and then...) Apr 6 18:43:39 <node2> kernel: [ 9229.653782] dlm: closing connection to node 1 Apr 6 18:43:39 <node2> openais[7119]: cman killed by node 1 because we rejoined the cluster without a full restart Apr 6 18:43:39 <node2> groupd[7128]: cman_get_nodes error -1 104 Apr 6 18:43:39 <node2> gfs_controld[7162]: cluster is down, exiting Apr 6 18:43:39 <node2> dlm_controld[7160]: cluster is down, exiting Apr 6 18:43:39 <node2> kernel: [ 9229.897402] dlm: closing connection to node 2 Apr 6 18:43:41 <node2> fenced[7135]: Unable to connect to to cman: Connection refused Apr 6 18:43:41 <node2> fenced[7135]: fence "<node1>" success Apr 6 18:44:05 <node2> ccsd[7115]: Unable to connect to cluster infrastructure after 60 seconds. Apr 6 18:44:14 <node2> kernel: [ 9264.560626] rport-4:0-0: blocked FC remote port time out: saving binding Apr 6 18:44:14 <node2> kernel: [ 9264.560677] rport-4:0-1: blocked FC remote port time out: saving binding Apr 6 18:44:14 <node2> kernel: [ 9264.574363] sd 4:0:1:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Apr 6 18:44:14 <node2> kernel: [ 9264.574371] end_request: I/O error, dev sdd, sector 157284528 Apr 6 18:44:14 <node2> kernel: [ 9264.574386] GFS: fsid=srvwebGFS:www.1: fatal: I/O error Apr 6 18:44:14 <node2> kernel: [ 9264.574388] GFS: fsid=srvwebGFS:www.1: block = 19660562 Apr 6 18:44:14 <node2> kernel: [ 9264.574390] GFS: fsid=srvwebGFS:www.1: function = gfs_dreread Apr 6 18:44:14 <node2> kernel: [ 9264.574391] GFS: fsid=srvwebGFS:www.1: file = /build/buildd/linux-ubuntu-modules-2.6.24-2.6.24/debian/build/build-server/fs/gfs/dio.c, line = 576 Apr 6 18:44:14 <node2> kernel: [ 9264.574394] GFS: fsid=srvwebGFS:www.1: time = 1239057854 Apr 6 18:44:14 <node2> kernel: [ 9264.574399] GFS: fsid=srvwebGFS:www.1: about to withdraw from the cluster Apr 6 18:44:14 <node2> kernel: [ 9264.574401] GFS: fsid=srvwebGFS:www.1: telling LM to withdraw Apr 6 18:44:35 <node2> ccsd[7115]: Unable to connect to cluster infrastructure after 90 seconds. (...) (power manually cut off) **** ==== From the machine that had the GFS already mounted: Apr 6 18:42:31 <node1> kernel: [246196.908621] dlm: closing connection to node 2 Apr 6 18:42:31 <node1> fenced[5611]: <node2> not a cluster member after 0 sec post_fail_delay Apr 6 18:42:31 <node1> fenced[5611]: fencing node "<node2>" Apr 6 18:42:33 <node1> fenced[5611]: fence "<node2>" success Apr 6 18:42:33 <node1> kernel: [246198.585885] GFS: fsid=srvwebGFS:www.0: jid=1: Trying to acquire journal lock... Apr 6 18:43:19 <node1> kernel: [246244.692100] GFS: fsid=srvwebGFS:www.0: jid=1: Looking at journal... Apr 6 18:43:19 <node1> kernel: [246244.741659] GFS: fsid=srvwebGFS:www.0: jid=1: Done Apr 6 18:43:40 <node1> gfs_controld[5615]: cpg_mcast_joined retry 100 MSG_PLOCK Apr 6 18:43:40 <node1> gfs_controld[5615]: cpg_mcast_joined retry 200 MSG_PLOCK Apr 6 18:43:40 <node1> gfs_controld[5615]: cpg_mcast_joined retry 300 MSG_PLOCK Apr 6 18:43:40 <node1> gfs_controld[5615]: cpg_mcast_joined retry 400 MSG_PLOCK (... several like these ...) Apr 6 18:43:41 <node1> gfs_controld[5615]: cpg_mcast_joined retry 1300 MSG_PLOCK Apr 6 18:43:41 <node1> kernel: [246266.801054] qla2xxx 0000:02:01.0: LOOP DOWN detected (2). Apr 6 18:43:41 <node1> gfs_controld[5615]: cpg_mcast_joined retry 1400 MSG_PLOCK (... again until ...) Apr 6 18:43:54 <node1> gfs_controld[5615]: cpg_mcast_joined retry 14200 MSG_PLOCK Apr 6 18:44:16 <node1> kernel: [246301.772188] rport-4:0-0: blocked FC remote port time out: saving binding Apr 6 18:44:16 <node1> kernel: [246301.772236] rport-4:0-1: blocked FC remote port time out: saving binding Apr 6 18:44:16 <node1> kernel: [246301.790639] sd 4:0:1:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Apr 6 18:44:16 <node1> kernel: [246301.790647] end_request: I/O error, dev sdd, sector 11453568 Apr 6 18:44:16 <node1> kernel: [246301.790667] GFS: fsid=srvwebGFS:www.0: fatal: I/O error Apr 6 18:44:16 <node1> kernel: [246301.790672] GFS: fsid=srvwebGFS:www.0: block = 1431692 Apr 6 18:44:16 <node1> kernel: [246301.790676] GFS: fsid=srvwebGFS:www.0: function = gfs_dreread Apr 6 18:44:16 <node1> kernel: [246301.790680] GFS: fsid=srvwebGFS:www.0: file = /build/buildd/linux-ubuntu-modules-2.6.24-2.6.24/debian/build/build-server/fs/gfs/dio.c, line = 576 Apr 6 18:44:16 <node1> kernel: [246301.790688] GFS: fsid=srvwebGFS:www.0: time = 1239057856 Apr 6 18:44:16 <node1> kernel: [246301.790694] sd 4:0:1:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Apr 6 18:44:16 <node1> kernel: [246301.790700] end_request: I/O error, dev sdd, sector 141890640 Apr 6 18:44:16 <node1> kernel: [246301.790705] GFS: fsid=srvwebGFS:www.0: about to withdraw from the cluster Apr 6 18:44:16 <node1> kernel: [246301.790711] GFS: fsid=srvwebGFS:www.0: telling LM to withdraw Apr 6 18:44:16 <node1> kernel: [246301.790718] sd 4:0:1:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Apr 6 18:44:16 <node1> kernel: [246301.790731] end_request: I/O error, dev sdd, sector 3611616 Apr 6 18:44:16 <node1> kernel: [246301.790751] sd 4:0:1:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Apr 6 18:44:16 <node1> kernel: [246301.790765] end_request: I/O error, dev sdd, sector 64424648 Apr 6 18:44:16 <node1> kernel: [246301.790774] Buffer I/O error on device sdd1, logical block 8053077 (...) Apr 6 18:45:22 <node1> kernel: [246367.692933] dlm: www: group leave failed -512 0 Apr 6 18:45:22 <node1> kernel: [246368.160042] GFS: fsid=srvwebGFS:www.0: withdrawn (...) (power manually cut off) ============================== So, both machines fenced, GFS down, people screaming :) Up in the first log there is the line: Apr 6 18:43:39 <node2> openais[7119]: cman killed by node 1 because we rejoined the cluster without a full restart I wonder if it is consequence (it happened after the soft lockup) or cause. It was a clean boot without any of the cluster daemons going up before a I manually start each one of them. *Noticing now, I did unfenced the switch port when the machine was already up (but cman still down)*. Could that be the culprit? Thanks for any help or advice. Diego Morales -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster