(no subject)

Gert Wieberdink <Gert.Wieberdink@xxxxxxxxxxxx> · Tue, 10 Jan 2012 12:12:08 +0100

RHCS/GFS2 support team,

I would like to inform you about a serious GFS2 problem we encountered last week.
Please find a detailed description below. I have enclosed a tarfile containing
detailed information about this problem.

Description
        Two-node cluster is used as a test cluster without any load.
        Only functionality is tested, no performance tests. The RHCS services
        that run on this cluster are rather standard services.
        In a 2-day timeframe we had two occurrences of this problem which were
        both very similar.
        On the 2^nd  node, a Perl script tried to write some info to a file on
        the GFS2 filesystem, but the process hung at that time. From the GFS2
        lockdump info we saw one W-lock associated with an inode and it 
        turned out that the inode was a directory on GFS2. Every command executed on 
        that file (eg. ls -l) or on this directory resulted in a hang of that 
        process (eg. du <dirname>).
        The processes that hung all had the D-state (uninterruptable sleep).
        However, from the 1^st  node all files and directories were accessible without
        any problem. Even ls -lR executed on the 1st node from top of the GFS2
        filesystem traversed the full directory tree without problems.
        We suspect that the offending directory has got a W-lock and that there is
        no lock owner anymore. 
        So, it does not look like a 'global' file system hang, but it seems to
        to be a local problem on the 2^nd  node, where the major part of the GFS2 
        is also accessible from the 2^nd node, except the dir with the lock. 
        Needless to say that this causes the application to be unavailable.

                  We are unable to reproduce the problem.

        1st occurrence. After collecting information, we rebooted the 2nd node and after
        the reboot it joined the 1st node in the cluster without any problem.

        2nd occurrence. This happened 2 days later in the same way on the same node. After
        collecting information, we now also ran gfs2_fsck on the GFS2 filesystem
        before letting it join the cluster. No errors, orphans, corruption was reported.

        After the fsck we started the cluster software on the 2^nd  node and the 2^nd 
        node joined the cluster without any problem.
        Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was
        collected in a tarball (enov_additional_info.tar).

Additional information in additional_info.tar
- enov_clusterinfo_app2.txt.gz containing
                        - /etc/cluster.conf
                        - gfs2_hangalyzer output from 2nd node
                        - cman_tool <version, status, services, -af nodes>
                        - group_tool < -v, dump, dump fence, dump gfs2>
                        - ccs_tool <lsnode, lsfence>
                        - openais-cfgtool -s
                        - clustat -fl
                        - Process status information of all processes
                        - gfs2_tool gettune /gfsdata

                - enov_sysrq-t_app2.txt.gz
                - enov_glocks_app2.txt.gz
                - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm
                  directory from debugfs filesystem from 2nd node.
Environment
        2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2.
        Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed.
        Kernel version 2.6.18-274.12.1.el5PAE.
        One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume.
        iSCSI initiator version 6.2.0.872-10.el5.

Thanking you in advance for your cooperation. 
If you need additional information to help to solve this problem, please let me know.

With kind regards,
G. Wieberdink
Sr. Engineer at E.Novation

gert.wieberdink@xxxxxxxxxxxx

Attachment:
enov_additional_info.tar

Description: enov_additional_info.tar
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster