Hi, On Thu, 2010-09-16 at 14:43 -0600, Jeff Howell wrote: > I'm having an identical problem. > > I have 2 nodes running a Wordpress instance with a TCP load balancer in > front of them distributing http requests between them. > > In the last 2 days, I've had 10+ instances where the GFS2 volume hangs > with: > > Sep 16 14:05:10 wordpress3 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 16 14:05:10 wordpress3 kernel: delete_workqu D 00000272 2676 > 3687 19 3688 3686 (L-TLB) > Sep 16 14:05:10 wordpress3 kernel: f7839e38 00000046 3f1c322e > 00000272 00000000 f57ab400 f7839df8 0000000a > Sep 16 14:05:10 wordpress3 kernel: c3217aa0 3f1dcca8 00000272 > 00019a7a 00000001 c3217bac c3019744 f57c5ac0 > Sep 16 14:05:10 wordpress3 kernel: f8afa21c 00000003 f26162f0 > 00000000 f2213df8 00000018 c3019c00 f7839e6c > Sep 16 14:05:10 wordpress3 kernel: Call Trace: > Sep 16 14:05:10 wordpress3 kernel: [<f8afa21c>] gdlm_bast+0x0/0x78 > [lock_dlm] > Sep 16 14:05:10 wordpress3 kernel: [<f8c3910e>] just_schedule+0x5/0x8 > [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c061d2f5>] __wait_on_bit+0x33/0x58 > Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8 > [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8 > [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c061d37c>] > out_of_line_wait_on_bit+0x62/0x6a > Sep 16 14:05:10 wordpress3 kernel: [<c0436098>] wake_bit_function+0x0/0x3c > Sep 16 14:05:10 wordpress3 kernel: [<f8c39102>] > gfs2_glock_wait+0x27/0x2e [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c4c667>] > gfs2_check_blk_type+0xbc/0x18c [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c061d312>] __wait_on_bit+0x50/0x58 > Sep 16 14:05:10 wordpress3 kernel: [<f8c39109>] just_schedule+0x0/0x8 > [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c4c660>] > gfs2_check_blk_type+0xb5/0x18c [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c4c3c8>] > gfs2_rindex_hold+0x2b/0x148 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c48273>] > gfs2_delete_inode+0x6f/0x1a1 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c4823b>] > gfs2_delete_inode+0x37/0x1a1 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<f8c48204>] > gfs2_delete_inode+0x0/0x1a1 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c048cb02>] > generic_delete_inode+0xa5/0x10f > Sep 16 14:05:10 wordpress3 kernel: [<c048c5a6>] iput+0x64/0x66 > Sep 16 14:05:10 wordpress3 kernel: [<f8c3a8bb>] > delete_work_func+0x49/0x53 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c04332da>] run_workqueue+0x78/0xb5 > Sep 16 14:05:10 wordpress3 kernel: [<f8c3a872>] > delete_work_func+0x0/0x53 [gfs2] > Sep 16 14:05:10 wordpress3 kernel: [<c0433b8e>] worker_thread+0xd9/0x10b > Sep 16 14:05:10 wordpress3 kernel: [<c041f81b>] > default_wake_function+0x0/0xc > Sep 16 14:05:10 wordpress3 kernel: [<c0433ab5>] worker_thread+0x0/0x10b > Sep 16 14:05:10 wordpress3 kernel: [<c0435fa7>] kthread+0xc0/0xed > Sep 16 14:05:10 wordpress3 kernel: [<c0435ee7>] kthread+0x0/0xed > Sep 16 14:05:10 wordpress3 kernel: [<c0405c53>] > kernel_thread_helper+0x7/0x10 > > And then a bunch more for the httpd processes. I can pretty much > reproduce this consistently by untarring a large tarball on the volume. > Seems like anything IO intensive is causing this behavior. > > Running CentOS 5.5 with kernel 2.6.18-194.11.1.el5 #1 SMP Tue Aug 10 > 19:09:06 EDT 2010 i686 i686 i386 GNU/Linux > > I tried the hangalizer program and it always came back with: > /bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls > /gfs2/" > /bin/ls: /gfs2/: No such file or directoryhb.medianewsgroup.com "/bin/ls > /gfs2/" > No waiting glocks found on any node. > > Any Ideas? > Can you report this via our support team? or if you don't have a support contract at least via bugzilla so that we have a record of the problem which won't get missed? That doesn't look at all right to me, so I'd like to get to the bottom of what is going on here. > On 08/03/2010 01:38 PM, Scooter Morris wrote: > > HI all, > > We continue to have gfs2 crashes and hangs on our production > > cluster, so I'm beginning to think that we've done something really > > wrong. Here is our set-up: > > > > * 4 node cluster, only 3 participate in gfs2 filesystems > > * Running several services on multiple nodes using gfs2: > > o IMAP (dovecot) > > o Web (apache with lots of python) > > o Samba (using ctdb) > > * GFS2 partitions are multipathed on an HP EVA-based SAN (no LVM) > > -- here is fstab from one node (the three nodes are all the same): > > > > LABEL=/1 / ext3 > > defaults 1 1 > > LABEL=/boot1 /boot ext3 > > defaults 1 2 > > tmpfs /dev/shm tmpfs > > defaults 0 0 > > devpts /dev/pts devpts > > gid=5,mode=620 0 0 > > sysfs /sys sysfs > > defaults 0 0 > > proc /proc proc > > defaults 0 0 > > LABEL=SW-cciss/c0d0p2 swap swap > > defaults 0 0 > > LABEL=plato:Mail /var/spool/mail gfs2 > > noatime,_netdev > > LABEL=plato:VarTmp /var/tmp gfs2 _netdev > > LABEL=plato:UsrLocal /usr/local gfs2 > > noatime,_netdev > > LABEL=plato:UsrLocalProjects /usr/local/projects gfs2 > > noatime,_netdev > > LABEL=plato:Home2 /home/socr gfs2 > > noatime,_netdev > > LABEL=plato:HomeNoBackup /home/socr/nobackup gfs2 _netdev > > LABEL=plato:DbBackup /databases/backups gfs2 > > noatime,_netdev > > LABEL=plato:DbMol /databases/mol gfs2 > > noatime,_netdev > > LABEL=plato:MolDbBlast /databases/mol/blast gfs2 > > noatime,_netdev > > LABEL=plato:MolDbEmboss /databases/mol/emboss gfs2 > > noatime,_netdev > > > > * Kernel version is: 2.6.18-194.3.1.el5 and all nodes are x86_64. > > * What's happening is every so often, we start seeing gfs2-related > > task hangs in the logs. In the last instance (last Friday) > > we've got this: > > > > Node 0: > > > > [2010-07-30 13:23:25]INFO: task imap:25716 blocked for > > more than 120 seconds.^M > > [2010-07-30 13:23:25]"echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this > > message.^M > > [2010-07-30 13:23:25]imap D ffff8100010825a0 > > 0 25716 9217 24080 25667 (NOTLB)^M > > [2010-07-30 13:23:25] ffff810619b59bc8 0000000000000086 > > ffff810113233f10 ffffffff00000000^M > > [2010-07-30 13:23:26] ffff81000f8c5cd0 000000000000000a > > ffff810233416040 ffff81082fd05100^M > > [2010-07-30 13:23:26] 00012196d153c88e 0000000000008b81 > > ffff810233416228 0000000f6a949180^M > > [2010-07-30 13:23:26]Call Trace:^M > > [2010-07-30 13:23:26] [<ffffffff887d0be6>] > > :gfs2:gfs2_dirent_find+0x0/0x4e^M > > [2010-07-30 13:23:26] [<ffffffff887d0c18>] > > :gfs2:gfs2_dirent_find+0x32/0x4e^M > > [2010-07-30 13:23:26] [<ffffffff887d5ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:23:26] [<ffffffff887d5ef0>] > > :gfs2:just_schedule+0x9/0xe^M > > [2010-07-30 13:23:26] [<ffffffff80063a16>] > > __wait_on_bit+0x40/0x6e^M > > [2010-07-30 13:23:26] [<ffffffff887d5ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:23:26] [<ffffffff80063ab0>] > > out_of_line_wait_on_bit+0x6c/0x78^M > > [2010-07-30 13:23:26] [<ffffffff800a0aec>] > > wake_bit_function+0x0/0x23^M > > [2010-07-30 13:23:26] [<ffffffff887d5ee2>] > > :gfs2:gfs2_glock_wait+0x2b/0x30^M > > [2010-07-30 13:23:26] [<ffffffff887e579e>] > > :gfs2:gfs2_permission+0x83/0xd5^M > > [2010-07-30 13:23:26] [<ffffffff887e5796>] > > :gfs2:gfs2_permission+0x7b/0xd5^M > > [2010-07-30 13:23:26] [<ffffffff8000ce97>] > > do_lookup+0x65/0x1e6^M > > [2010-07-30 13:23:26] [<ffffffff8000d918>] > > permission+0x81/0xc8^M > > [2010-07-30 13:23:26] [<ffffffff8000997f>] > > __link_path_walk+0x173/0xf42^M > > [2010-07-30 13:23:26] [<ffffffff8000e9e2>] > > link_path_walk+0x42/0xb2^M > > [2010-07-30 13:23:26] [<ffffffff8000ccb2>] > > do_path_lookup+0x275/0x2f1^M > > [2010-07-30 13:23:26] [<ffffffff8001280e>] > > getname+0x15b/0x1c2^M > > [2010-07-30 13:23:27] [<ffffffff80023876>] > > __user_walk_fd+0x37/0x4c^M > > [2010-07-30 13:23:27] [<ffffffff80028846>] > > vfs_stat_fd+0x1b/0x4a^M > > [2010-07-30 13:23:27] [<ffffffff800638b3>] > > schedule_timeout+0x92/0xad^M > > [2010-07-30 13:23:27] [<ffffffff80097dab>] > > process_timeout+0x0/0x5^M > > [2010-07-30 13:23:27] [<ffffffff800f8435>] > > sys_epoll_wait+0x3b8/0x3f9^M > > [2010-07-30 13:23:27] [<ffffffff800235a8>] > > sys_newstat+0x19/0x31^M > > [2010-07-30 13:23:27] [<ffffffff8005d229>] > > tracesys+0x71/0xe0^M > > [2010-07-30 13:23:27] [<ffffffff8005d28d>] > > tracesys+0xd5/0xe0^M > > > > Node 1: > > > > [2010-07-30 13:23:59]INFO: task pdflush:623 blocked for > > more than 120 seconds.^M > > [2010-07-30 13:23:59]"echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this > > message.^M > > [2010-07-30 13:23:59]pdflush D ffff810407069aa0 > > 0 623 291 624 622 (L-TLB)^M > > [2010-07-30 13:23:59] ffff8106073c1bd0 0000000000000046 > > 0000000000000001 ffff8103fea899a8^M > > [2010-07-30 13:23:59] ffff8106073c1c30 000000000000000a > > ffff8105fff7c0c0 ffff8107fff4c820^M > > [2010-07-30 13:24:00] 0000ed85d9d7a027 0000000000011b50 > > ffff8105fff7c2a8 00000006f0a9d0d0^M > > [2010-07-30 13:24:00]Call Trace:^M > > [2010-07-30 13:24:00] [<ffffffff8001a927>] > > submit_bh+0x10a/0x111^M > > [2010-07-30 13:24:00] [<ffffffff88802ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:24:00] [<ffffffff88802ef0>] > > :gfs2:just_schedule+0x9/0xe^M > > [2010-07-30 13:24:00] [<ffffffff80063a16>] > > __wait_on_bit+0x40/0x6e^M > > [2010-07-30 13:24:00] [<ffffffff88802ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:24:00] [<ffffffff80063ab0>] > > out_of_line_wait_on_bit+0x6c/0x78^M > > [2010-07-30 13:24:00] [<ffffffff800a0aec>] > > wake_bit_function+0x0/0x23^M > > [2010-07-30 13:24:00] [<ffffffff88802ee2>] > > :gfs2:gfs2_glock_wait+0x2b/0x30^M > > [2010-07-30 13:24:00] [<ffffffff88813269>] > > :gfs2:gfs2_write_inode+0x5f/0x152^M > > [2010-07-30 13:24:00] [<ffffffff88813261>] > > :gfs2:gfs2_write_inode+0x57/0x152^M > > [2010-07-30 13:24:00] [<ffffffff8002fbf8>] > > __writeback_single_inode+0x1e9/0x328^M > > [2010-07-30 13:24:00] [<ffffffff80020ec9>] > > sync_sb_inodes+0x1b5/0x26f^M > > [2010-07-30 13:24:00] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:00] [<ffffffff8005123a>] > > writeback_inodes+0x82/0xd8^M > > [2010-07-30 13:24:00] [<ffffffff800c97b5>] > > wb_kupdate+0xd4/0x14e^M > > [2010-07-30 13:24:00] [<ffffffff80056879>] pdflush+0x0/0x1fb^M > > [2010-07-30 13:24:00] [<ffffffff800569ca>] > > pdflush+0x151/0x1fb^M > > [2010-07-30 13:24:00] [<ffffffff800c96e1>] > > wb_kupdate+0x0/0x14e^M > > [2010-07-30 13:24:01] [<ffffffff80032894>] > > kthread+0xfe/0x132^M > > [2010-07-30 13:24:01] [<ffffffff8009d734>] > > request_module+0x0/0x14d^M > > [2010-07-30 13:24:01] [<ffffffff8005dfb1>] > > child_rip+0xa/0x11^M > > [2010-07-30 13:24:01] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:01] [<ffffffff80032796>] kthread+0x0/0x132^M > > [2010-07-30 13:24:01] [<ffffffff8005dfa7>] > > child_rip+0x0/0x11^M > > > > Node 2: > > > > [2010-07-30 13:24:46]INFO: task delete_workqueu:7175 > > blocked for more than 120 seconds.^M > > [2010-07-30 13:24:46]"echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this > > message.^M > > [2010-07-30 13:24:46]delete_workqu D ffff81082b5cf860 > > 0 7175 329 7176 7174 (L-TLB)^M > > [2010-07-30 13:24:46] ffff81081ed6dbf0 0000000000000046 > > 0000000000000018 ffffffff887a84f3^M > > [2010-07-30 13:24:46] 0000000000000286 000000000000000a > > ffff81082dd477e0 ffff81082b5cf860^M > > [2010-07-30 13:24:46] 00012166bf7ec21d 000000000002ed0b > > ffff81082dd479c8 00000007887a9e5a^M > > [2010-07-30 13:24:46]Call Trace:^M > > [2010-07-30 13:24:46] [<ffffffff887a84f3>] > > :dlm:request_lock+0x93/0xa0^M > > [2010-07-30 13:24:47] [<ffffffff8884f556>] > > :lock_dlm:gdlm_ast+0x0/0x311^M > > [2010-07-30 13:24:47] [<ffffffff8884f2c1>] > > :lock_dlm:gdlm_bast+0x0/0x8d^M > > [2010-07-30 13:24:47] [<ffffffff887d3ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:24:47] [<ffffffff887d3ef0>] > > :gfs2:just_schedule+0x9/0xe^M > > [2010-07-30 13:24:47] [<ffffffff80063a16>] > > __wait_on_bit+0x40/0x6e^M > > [2010-07-30 13:24:47] [<ffffffff887d3ee7>] > > :gfs2:just_schedule+0x0/0xe^M > > [2010-07-30 13:24:47] [<ffffffff80063ab0>] > > out_of_line_wait_on_bit+0x6c/0x78^M > > [2010-07-30 13:24:47] [<ffffffff800a0aec>] > > wake_bit_function+0x0/0x23^M > > [2010-07-30 13:24:47] [<ffffffff887d3ee2>] > > :gfs2:gfs2_glock_wait+0x2b/0x30^M > > [2010-07-30 13:24:47] [<ffffffff887e82cf>] > > :gfs2:gfs2_check_blk_type+0xd7/0x1c9^M > > [2010-07-30 13:24:47] [<ffffffff887e82c7>] > > :gfs2:gfs2_check_blk_type+0xcf/0x1c9^M > > [2010-07-30 13:24:47] [<ffffffff80063ab0>] > > out_of_line_wait_on_bit+0x6c/0x78^M > > [2010-07-30 13:24:47] [<ffffffff887e804f>] > > :gfs2:gfs2_rindex_hold+0x32/0x12b^M > > [2010-07-30 13:24:47] [<ffffffff887d5a29>] > > :gfs2:delete_work_func+0x0/0x65^M > > [2010-07-30 13:24:47] [<ffffffff887d5a29>] > > :gfs2:delete_work_func+0x0/0x65^M > > [2010-07-30 13:24:47] [<ffffffff887e3e3a>] > > :gfs2:gfs2_delete_inode+0x76/0x1b4^M > > [2010-07-30 13:24:47] [<ffffffff887e3e01>] > > :gfs2:gfs2_delete_inode+0x3d/0x1b4^M > > [2010-07-30 13:24:47] [<ffffffff8000d3ba>] dput+0x2c/0x114^M > > [2010-07-30 13:24:48] [<ffffffff887e3dc4>] > > :gfs2:gfs2_delete_inode+0x0/0x1b4^M > > [2010-07-30 13:24:48] [<ffffffff8002f35e>] > > generic_delete_inode+0xc6/0x143^M > > [2010-07-30 13:24:48] [<ffffffff887d5a83>] > > :gfs2:delete_work_func+0x5a/0x65^M > > [2010-07-30 13:24:48] [<ffffffff8004d8f0>] > > run_workqueue+0x94/0xe4^M > > [2010-07-30 13:24:48] [<ffffffff8004a12b>] > > worker_thread+0x0/0x122^M > > [2010-07-30 13:24:48] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:48] [<ffffffff8004a21b>] > > worker_thread+0xf0/0x122^M > > [2010-07-30 13:24:48] [<ffffffff8008d087>] > > default_wake_function+0x0/0xe^M > > [2010-07-30 13:24:48] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:48] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:48] [<ffffffff80032894>] > > kthread+0xfe/0x132^M > > [2010-07-30 13:24:48] [<ffffffff8005dfb1>] > > child_rip+0xa/0x11^M > > [2010-07-30 13:24:48] [<ffffffff800a08a6>] > > keventd_create_kthread+0x0/0xc4^M > > [2010-07-30 13:24:48] [<ffffffff80032796>] kthread+0x0/0x132^M > > [2010-07-30 13:24:48] [<ffffffff8005dfa7>] > > child_rip+0x0/0x11^M > > > > * Various messages related to hung_task_timeouts repeated on each > > node (usually related to imap). > > * Within a minute or two, the cluster was completely hung. Root > > could log into the console, but commands (like dmesg) would just > > hang. > > > > So, my major question: is there something wrong with my > > configuration? Have we done something really stupid? The initial > > response from RedHat was that we shouldn't run services on multiple > > nodes that access gfs2, which seems a little confusing since we would > > use ext3 or ext4 if we were going to node lock (or failover) the > > partitions. Have we missed something somewhere? > > That doesn't sound quite right... our guidance is not to run NFS/Samba either together on the same GFS2 directory tree or in combination with local applications. Otherwise there shouldn't be any issues with running multiple applications on the same GFS2 tree/mount, Steve. > > Thanks in advance for any help anyone can give. We're getting pretty > > desperate here since the downtime is starting to have a significant > > impact on our credibility. > > > > -- scooter > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster