Re: RHEL5 GFS2 - 2 node - node fenced when writing

Steven Whitehouse <swhiteho@xxxxxxxxxx> · Thu, 28 Jun 2007 00:40:13 +0100

Hi,

On Wed, 2007-06-27 at 18:35 -0400, nrbwpi@xxxxxxxxx wrote:
> Thanks for your reply
> 
> I switched the hardware over to Fedora core 6, brought the system
> up2date, and configured it the same as before with GFS2. Uname returns
> the following kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed
> May 16 18:18:22 EDT 2007  86_64 x86_64 x86_64 GNU/Linux". 
> 
> The same fencing occurred after several hours of writing zeros to the
> volume with dd in 250MB files.  This time, however, I noticed a kernel
> panic on the fenced node.  The kernel output in /var/log/messages is
> below.  Could this be a hardware configuration issue, or a bug in the
> kernel?
> 
Its a kernel bug. We are currently working on fixing something in the
same area, so it might be that you've tripped over the same thing, or
something related anyway. There are also a few patches (quite recent,
again) which are in the git tree, but haven't made it into FC-6 yet, so
it might also be one of those that will fix the problem. I'll try and
get another set of update patches done shortly - I'm out of the office
at that moment which makes such things a bit slower than usual I'm
afraid.

If you are able to test the current GFS2 git tree kernel and you are
still having the problem, then please report it through the Red Hat
bugzilla,

Steve.

>  
> 
> #####################################
> 
>  
> 
> Kernel panic
> 
>  
> 
> #####################################
> 
>  
> 
> Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------
> 
> Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67!
> 
> Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP
> 
> Jun 26 10:00:41 fu2 kernel: last sysfs
> file: /devices/pci0000:00/0000:00:02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq
> 
> Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules
> linked in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat
> nf_nat nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT
> xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp xfs
> rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa
> ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi
> dm_multipath video sbs i2c_ec i2c_core dock button battery asus_acpi
> backlight ac parport_pc lp parport sg ata_piix libata pcspkr bnx2
> ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod lpfc
> scsi_transport_fc shpchp megaraid_sas sd_mod scsi_mod ext3 jbd
> ehci_hcd ohci_hcd uhci_hcd
> 
> Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted
> 2.6.20-1.2952.fc6 #1
> 
> Jun 26 10:00:41 fu2 kernel: RIP: 0010:[<ffffffff80341368>]
> [<ffffffff80341368>] list_del+0x21/0x5b
> 
> Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00  EFLAGS:
> 00010082
> 
> Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX:
> ffff81011aa40000 RCX: ffffffff8057fc58
> 
> Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI:
> 0000000000000000 RDI: ffffffff8057fc40
> 
> Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08:
> ffffffff8057fc58 R09: 0000000000000001
> 
> Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11:
> ffff81012fd9d0c0 R12: ffff81011aa40f70
> 
> Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14:
> ffff810123fb05d8 R15: 0000000000000036
> 
> Jun 26 10:00:41 fu2 kernel: FS:  0000000000000000(0000)
> GS:ffff81012fdb47c0(0000) knlGS:0000000000000000
> 
> Jun 26 10:00:41 fu2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> 
> Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3:
> 0000000042c20000 CR4: 00000000000006e0
> 
> Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo
> ffff81011e246000, task ffff810121d35800)
> 
> Jun 26 10:00:41 fu2 kernel: Stack:  ffff810123fb1a00 ffffffff802cc6e7
> 0000003c00000000 ffff81012da3f7c0
> 
> Jun 26 10:00:41 fu2 kernel:  000000000000003c ffff810123fb0400
> 0000000000000000 ffff810123fb1a00
> 
> Jun 26 10:00:41 fu2 kernel:  ffff81012da3f800 ffffffff802cc8be
> ffff810123fb07e8 ffff810123fb0400
> 
> Jun 26 10:00:41 fu2 kernel: Call Trace:
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc6e7>] free_block
> +0xb1/0x142
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc8be>] cache_flusharray
> +0x7d/0xb1
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8020765f>] kmem_cache_free
> +0x1ef/0x20c
> 
> Jun 26 10:00:41 fu2 kernel:
> [<ffffffff88445628>] :gfs2:databuf_lo_before_commit+0x576/0x5c6
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88443acf>] :gfs2:gfs2_log_flush
> +0x11e/0x2d3
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438310>] :gfs2:gfs2_logd
> +0xab/0x15b
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438265>] :gfs2:gfs2_logd
> +0x0/0x15b
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
> keventd_create_kthread+0x0/0x6a
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802318bd>] kthread+0xd0/0xff
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aec8>] child_rip+0xa/0x12
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
> keventd_create_kthread+0x0/0x6a
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff802317ed>] kthread+0x0/0xff
> 
> Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aebe>] child_rip+0x0/0x12
> 
> Jun 26 10:00:41 fu2 kernel:
> 
> Jun 26 10:00:41 fu2 kernel:
> 
> Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48
> 39 fa 74 12 48 c7 c7 97
> 
> Jun 26 10:00:41 fu2 kernel: RIP  [<ffffffff80341368>] list_del
> +0x21/0x5b
> 
> Jun 26 10:00:41 fu2 kernel:  RSP <ffff81011e247d00>
> 
> 
> 
> On 6/7/07, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
>         Hi,
>         
>         The version of GFS2 in RHEL5 is rather old. Please use Fedora,
>         the
>         upstream kernel or wait until RHEL 5.1 is out. This should
>         solve the
>         problem that you are seeing,
>         
>         Steve.
>         
>         On Wed, 2007-06-06 at 19:27 -0400, nrbwpi@xxxxxxxxx wrote:
>         > Hello,
>         >
>         > Installed RHEL5 on a new two node cluster with Shared FC
>         storage.  The
>         > two shared storage boxes are each split into 6.9TB LUNs for
>         a total of
>         > 4 - 6.9TB LUNS.  Each machine is connected via a single
>         100Mb
>         > connection to a switch and a single FC connection to a FC
>         switch.
>         >
>         > The 4 LUNs have LVM on them with GFS2.  The file systems are
>         mountable 
>         > from each box.  When performing a script dd write of zeros
>         in 250MB
>         > file sizes to the file system from each box to different
>         LUNS, one of
>         > the nodes in the cluster is fenced by the other one.  File
>         size does 
>         > not seem to matter.
>         >
>         > My first guess at the problem was the heartbeat timeout in
>         openais.
>         > In the cluster.conf below I added the totem line to
>         hopefully raise
>         > the timeout to 10 seconds.  This however did not resolve the
>         problem. 
>         > Both boxes are running the latest updates as of 2 days ago
>         from
>         > up2date.
>         >
>         > Below is the cluster.conf and what is seen in the logs.  Any
>         > suggestions would be greatly appreciated.
>         > 
>         > Thanks!
>         >
>         > Neal
>         >
>         >
>         >
>         > ##########################################
>         >
>         > Cluster.conf
>         >
>         > ##########################################
>         >
>         >
>         > <?xml version=" 1.0"?>
>         > <cluster alias="storage1" config_version="4"
>         name="storage1">
>         >         <fence_daemon post_fail_delay="0"
>         post_join_delay="3"/>
>         >         <clusternodes>
>         >                 <clusternode name="fu1" nodeid="1"
>         votes="1">
>         >                         <fence>
>         >                                 <method name="1"> 
>         >                                         <device name="apc4"
>         port="1"
>         > switch="1"/>
>         >                                 </method>
>         >                         </fence> 
>         >                         <multicast addr=" 224.10.10.10"
>         > interface="eth0"/>
>         >                 </clusternode>
>         >                 <clusternode name="fu2" nodeid="2"
>         votes="1"> 
>         >                         <fence>
>         >                                 <method name="1">
>         >                                         <device name="apc4"
>         port="2"
>         > switch="1"/>
>         >                                 </method>
>         >                         </fence>
>         >                         <multicast addr="224.10.10.10"
>         > interface="eth0"/>
>         >                 </clusternode>
>         >         </clusternodes>
>         >         <cman expected_votes="1" two_node="1"> 
>         >                 <multicast addr="224.10.10.10"/>
>         >                 <totem token="10000"/>
>         >         </cman>
>         >         <fencedevices> 
>         >                 <fencedevice agent="fence_apc"
>         ipaddr="192.168.14.193"
>         > login="apc" name="apc4" passwd="apc"/>
>         >         </fencedevices>
>         >         <rm>
>         >                 <failoverdomains/>
>         >                 <resources/>
>         >         </rm>
>         > </cluster>
>         >
>         > 
>         > #####################################################
>         >
>         > /var/log/messages
>         >
>         > #####################################################
>         >
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was
>         lost in the 
>         > OPERATIONAL state.
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast
>         socket
>         > recv buffer size (262142 bytes).
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit
>         multicast socket 
>         > send buffer size (262142 bytes).
>         > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER
>         state from
>         > 2.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER
>         state from
>         > 0.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit
>         token
>         > because I am the rep.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru
>         6e high
>         > seq received 6e
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT
>         state. 
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY
>         state.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0]
>         member
>         > 192.168.14.195:
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq
>         16 rep 
>         > 192.168.14.195
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high
>         delivered 6e
>         > received flag 0
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to
>         originate 
>         > any messages in recovery.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new
>         sequence id for
>         > ring 14
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial
>         ORF token
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION
>         CHANGE 
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New
>         Configuration:
>         > Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node
>         2
>         > Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member
>         after 0 sec
>         > post_fail_delay
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
>         > ip(192.168.14.195)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
>         > Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2" 
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
>         > ip(192.168.14.197)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
>         > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is
>         within the 
>         > primary component and will provide service.
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION
>         CHANGE
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New
>         Configuration:
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) 
>         > ip(192.168.14.195)
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
>         > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is
>         within the 
>         > primary component and will provide service.
>         > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering
>         OPERATIONAL state.
>         > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin
>         message
>         > 192.168.14.195
>         > Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist
>         message from
>         > node 1
>         > Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1: 
>         > Trying to acquire journal lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1: 
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Looking at journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock... 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Replaying journal...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Replayed 0 of 0 blocks
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1: 
>         > Found 0 revoke tags
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Journal replayed in 1s
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0:
>         jid=1:
>         > Done 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock...
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Replayed 0 of 0 blocks
>         > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Found 0 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0:
>         jid=1:
>         > Done
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Acquiring the transaction lock...
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Replayed 222 of 223 blocks
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Found 1 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0:
>         jid=1:
>         > Done
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Replaying journal... 
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Replayed 438 of 439 blocks
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Found 1 revoke tags
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1: 
>         > Journal replayed in 1s
>         > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0:
>         jid=1:
>         > Done
>         >
>         >
>         > --
>         > Linux-cluster mailing list
>         > Linux-cluster@xxxxxxxxxx
>         > https://www.redhat.com/mailman/listinfo/linux-cluster
>         
>         --
>         Linux-cluster mailing list
>         Linux-cluster@xxxxxxxxxx
>         https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster