RHCS/GFS2 support team, I would like to inform you about a serious GFS2 problem we encountered last week. Please find a detailed description below. I have enclosed a tarfile containing detailed information about this problem. Description Two-node cluster is used as a test cluster without any load. Only functionality is tested, no performance tests. The RHCS services that run on this cluster are rather standard services. In a 2-day timeframe we had two occurrences of this problem which were both very similar. On the 2nd node, a Perl script tried to write some info to a file on the GFS2 filesystem, but the process hung at that time. From the GFS2 lockdump info we saw one W-lock associated with an inode and it turned out that the inode was a directory on GFS2. Every command executed on that file (eg. ls -l) or on this directory resulted in a hang of that process (eg. du <dirname>). The processes that hung all had the D-state (uninterruptable sleep). However, from the 1st node all files and directories were accessible without any problem. Even ls -lR executed on the 1st node from top of the GFS2 filesystem traversed the full directory tree without problems. We suspect that the offending directory has got a W-lock and that there is no lock owner anymore. So, it does not look like a 'global' file system hang, but it seems to to be a local problem on the 2nd node, where the major part of the GFS2 is also accessible from the 2nd node, except the dir with the lock. Needless to say that this causes the application to be unavailable. We are unable to reproduce the problem. 1st occurrence. After collecting information, we rebooted the 2nd node and after the reboot it joined the 1st node in the cluster without any problem. 2nd occurrence. This happened 2 days later in the same way on the same node. After collecting information, we now also ran gfs2_fsck on the GFS2 filesystem before letting it join the cluster. No errors, orphans, corruption was reported. After the fsck we started the cluster software on the 2nd node and the 2nd node joined the cluster without any problem. Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was collected in a tarball (enov_additional_info.tar). Additional information in additional_info.tar - enov_clusterinfo_app2.txt.gz containing - /etc/cluster.conf - gfs2_hangalyzer output from 2nd node - cman_tool <version, status, services, -af nodes> - group_tool < -v, dump, dump fence, dump gfs2> - ccs_tool <lsnode, lsfence> - openais-cfgtool -s - clustat -fl - Process status information of all processes - gfs2_tool gettune /gfsdata - enov_sysrq-t_app2.txt.gz - enov_glocks_app2.txt.gz - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm directory from debugfs filesystem from 2nd node. Environment 2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2. Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed. Kernel version 2.6.18-274.12.1.el5PAE. One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume. iSCSI initiator version 6.2.0.872-10.el5. Thanking you in advance for your cooperation. If you need additional information to help to solve this problem, please let me know. With kind regards, G. Wieberdink Sr. Engineer at E.Novation |
Attachment:
enov_additional_info.tar
Description: enov_additional_info.tar
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster