Dear All, we are facing a problem in our computer room, we have 6 servers that act like bricks for GlusterFS, the servers are configured in the following way: OS: Centos 6.2 x86_64 Kernel: 2.6.32-220.4.2.el6.x86_64 Gluster RPM packages: glusterfs-core-3.2.5-2.el6.x86_64 glusterfs-rdma-3.2.5-2.el6.x86_64 glusterfs-geo-replication-3.2.5-2.el6.x86_64 glusterfs-fuse-3.2.5-2.el6.x86_64 Each one is contributing a XFS filesystem to the global volume, the transport mechanism is RDMA: gluster volume create HPC_data transport rdma pleiades01:/data pleiades02:/data pleiades03:/data pleiades04:/data pleiades05:/data pleiades06:/data Each server mounts, using the fuse driver, the volume on a dedicated mount point according to the following fstab: pleiades01:/HPC_data /HPCdata glusterfs defaults,_netdev 0 0 We are running mongodb on top of the Gluster volume for performance testing and speed is definitely high. Unfortunately when we run a large mongoimport job after short time from the beginning the GlusterFS volume hangs completely and is inaccessible from any node. The following error is logged after some time: Mar 8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more than 120 seconds. Mar 8 08:16:03 pleiades03 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 8 08:16:03 pleiades03 kernel: mongod D 0000000000000007 0 5508 1 0x00000000 Mar 8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086 0000000000000000 0000000000000008 Mar 8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f ffff8816b6974cc0 0000000000000000 Mar 8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8 000000000000f4e8 ffff8817fdd81af8 Mar 8 08:16:03 pleiades03 kernel: Call Trace: Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ? wake_up_bit+0x2f/0x40 Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ? prepare_to_wait+0x4e/0x80 Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>] fuse_set_nowrite+0xa5/0xe0 [fuse] Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40 Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>] fuse_fsync_common+0xa8/0x180 [fuse] Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>] fuse_fsync+0x10/0x20 [fuse] Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>] vfs_fsync_range+0xa1/0xe0 Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>] vfs_fsync+0x1d/0x20 Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81144421>] sys_msync+0x151/0x1e0 Mar 8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Any attempt to access the volume from any node is fruitless until the mongodb process is killed, the session accessing the /HPCdata path gets freezed. Anyway a complete stop (force) and start of the volume is needed to have it back operational. The situation can be reproduced at will. Is there anybody able to help us? Could we collect more pieces of information to help diagnosing the problem? Thanks a lot Alessio -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120312/6d888a41/attachment.htm>