Is this a new setup and used to work before? How is the CPU, memory etc? Also, what do you see in gluster nodes? On Wed, Mar 14, 2012 at 7:33 PM, Alessio Checcucci < alessio.checcucci at gmail.com> wrote: > Dear All, > we are facing a problem in our computer room, we have 6 servers that act > like bricks for GlusterFS, the servers are configured in the following way: > > OS: Centos 6.2 x86_64 > Kernel: 2.6.32-220.4.2.el6.x86_64 > > Gluster RPM packages: > glusterfs-core-3.2.5-2.el6.x86_64 > glusterfs-rdma-3.2.5-2.el6.x86_64 > glusterfs-geo-replication-3.2.5-2.el6.x86_64 > glusterfs-fuse-3.2.5-2.el6.x86_64 > > Each one is contributing a XFS filesystem to the global volume, the > transport mechanism is RDMA: > > gluster volume create HPC_data transport rdma pleiades01:/data > pleiades02:/data pleiades03:/data pleiades04:/data pleiades05:/data > pleiades06:/data > > Each server mounts, using the fuse driver, the volume on a dedicated mount > point according to the following fstab: > > pleiades01:/HPC_data /HPCdata glusterfs > defaults,_netdev 0 0 > > We are running mongodb on top of the Gluster volume for performance > testing and speed is definitely high. Unfortunately when we run a large > mongoimport job after short time from the beginning the GlusterFS volume > hangs completely and is inaccessible from any node. The following error is > logged after some time in /var/log/messages: > > Mar 8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more > than 120 seconds. > Mar 8 08:16:03 pleiades03 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Mar 8 08:16:03 pleiades03 kernel: mongod D 0000000000000007 0 > 5508 1 0x00000000 > Mar 8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086 > 0000000000000000 0000000000000008 > Mar 8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f > ffff8816b6974cc0 0000000000000000 > Mar 8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8 > 000000000000f4e8 ffff8817fdd81af8 > Mar 8 08:16:03 pleiades03 kernel: Call Trace: > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ? > wake_up_bit+0x2f/0x40 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ? > prepare_to_wait+0x4e/0x80 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>] > fuse_set_nowrite+0xa5/0xe0 [fuse] > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ? > autoremove_wake_function+0x0/0x40 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>] > fuse_fsync_common+0xa8/0x180 [fuse] > Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>] > fuse_fsync+0x10/0x20 [fuse] > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>] > vfs_fsync_range+0xa1/0xe0 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>] vfs_fsync+0x1d/0x20 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81144421>] > sys_msync+0x151/0x1e0 > Mar 8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>] > system_call_fastpath+0x16/0x1b > > Any attempt to access the volume from any node is fruitless until the > mongodb process is killed, the sessions accessing the /HPCdata path gets > freezed on any node. > Anyway a complete stop (force) and start of the volume is needed to have > it back operational. > The situation can be reproduced at will. > Is there anybody able to help us? Could we collect more pieces of > information to help diagnosing the problem? > > Thanks a lot > Alessio > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120314/14380261/attachment.htm>