Re: RHEL5 GFS2 - 2 node - node fenced when writing

nrbwpi@xxxxxxxxx · Wed, 27 Jun 2007 18:35:57 -0400

Thanks for your reply

I switched the hardware over to Fedora core 6, brought the 
system up2date, and configured it the same as before with GFS2. Uname returns 
the following kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 
SMP Wed May 16 18:18:22 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux".

The same fencing occurred after several hours of writing 
zeros to the volume with dd in 250MB files.  This time, however, I noticed a 
kernel panic on the fenced node.  The kernel output in /var/log/messages is 
below.  Could this be a hardware configuration issue, or a bug in the kernel?

#####################################

Kernel panic

#####################################

Jun 26 10:00:41 fu2 kernel: ------------[ cut here 
]------------
Jun 26 10:00:41 fu2 kernel: kernel BUG at 
lib/list_debug.c:67!
Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] 
SMP
Jun 26 10:00:41 fu2 kernel: last sysfs file: 
/devices/pci0000:00/0000:00:02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq
Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: 
Modules linked in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat nf_nat 
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp 
iptable_filter ip_tables x_tables bridge autofs4 hidp xfs rfcomm l2cap bluetooth 
sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp 
libiscsi scsi_transport_iscsi dm_multipath video sbs i2c_ec i2c_core dock button 
battery asus_acpi backlight ac parport_pc lp parport sg ata_piix libata pcspkr 
bnx2 ide_cd cdrom serio_raw dm_snapshot dm_zero dm_mirror dm_mod lpfc 
scsi_transport_fc shpchp megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd 
uhci_hcd
Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not 
tainted 2.6.20-1.2952.fc6 #1
Jun 26 10:00:41 fu2 kernel: RIP: 
0010:[<ffffffff80341368>]  [<ffffffff80341368>] 
list_del+0x21/0x5b
Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00  
EFLAGS: 00010082
Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX: 
ffff81011aa40000 RCX: ffffffff8057fc58
Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI: 
0000000000000000 RDI: ffffffff8057fc40
Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08: 
ffffffff8057fc58 R09: 0000000000000001
Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11: 
ffff81012fd9d0c0 R12: ffff81011aa40f70
Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14: 
ffff810123fb05d8 R15: 0000000000000036
Jun 26 10:00:41 fu2 kernel: FS:  0000000000000000(0000) 
GS:ffff81012fdb47c0(0000) knlGS:0000000000000000
Jun 26 10:00:41 fu2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
000000008005003b
Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3: 
0000000042c20000 CR4: 00000000000006e0
Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, 
threadinfo ffff81011e246000, task ffff810121d35800)
Jun 26 10:00:41 fu2 kernel: Stack:  ffff810123fb1a00 
ffffffff802cc6e7 0000003c00000000 ffff81012da3f7c0
Jun 26 10:00:41 fu2 kernel:  000000000000003c 
ffff810123fb0400 0000000000000000 ffff810123fb1a00
Jun 26 10:00:41 fu2 kernel:  ffff81012da3f800 
ffffffff802cc8be ffff810123fb07e8 ffff810123fb0400
Jun 26 10:00:41 fu2 kernel: Call Trace:
Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc6e7>] 
free_block+0xb1/0x142
Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc8be>] 
cache_flusharray+0x7d/0xb1
Jun 26 10:00:41 fu2 kernel:  [<ffffffff8020765f>] 
kmem_cache_free+0x1ef/0x20c
Jun 26 10:00:41 fu2 kernel:  [<ffffffff88445628>] 
:gfs2:databuf_lo_before_commit+0x576/0x5c6
Jun 26 10:00:41 fu2 kernel:  [<ffffffff88443acf>] 
:gfs2:gfs2_log_flush+0x11e/0x2d3
Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438310>] 
:gfs2:gfs2_logd+0xab/0x15b
Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438265>] 
:gfs2:gfs2_logd+0x0/0x15b
Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>] 
keventd_create_kthread+0x0/0x6a
Jun 26 10:00:41 fu2 kernel:  [<ffffffff802318bd>] 
kthread+0xd0/0xff
Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aec8>] 
child_rip+0xa/0x12
Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>] 
keventd_create_kthread+0x0/0x6a
Jun 26 10:00:41 fu2 kernel:  [<ffffffff802317ed>] 
kthread+0x0/0xff
Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aebe>] 
child_rip+0x0/0x12
Jun 26 10:00:41 fu2 kernel:
Jun 26 10:00:41 fu2 kernel:
Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 
50 08 48 39 fa 74 12 48 c7 c7 97
Jun 26 10:00:41 fu2 kernel: RIP  [<ffffffff80341368>] 
list_del+0x21/0x5b
Jun 26 10:00:41 fu2 kernel:  RSP 
<ffff81011e247d00>

On 6/7/07, Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote:
Hi,

The version of GFS2 in RHEL5 is rather old. Please use Fedora, the
upstream kernel or wait until RHEL 5.1 is out. This should solve the
problem that you are seeing,

Steve.

On Wed, 2007-06-06 at 19:27 -0400, 
nrbwpi@xxxxxxxxx wrote:
> Hello,
>
> Installed RHEL5 on a new two node cluster with Shared FC storage.  The
> two shared storage boxes are each split into 6.9TB
 LUNs for a total of
> 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb
> connection to a switch and a single FC connection to a FC switch.
>
> The 4 LUNs have LVM on them with GFS2.  The file systems are mountable

> from each box.  When performing a script dd write of zeros in 250MB
> file sizes to the file system from each box to different LUNS, one of
> the nodes in the cluster is fenced by the other one.  File size does

> not seem to matter.
>
> My first guess at the problem was the heartbeat timeout in openais.
> In the cluster.conf below I added the totem line to hopefully raise
> the timeout to 10 seconds.  This however did not resolve the problem.

> Both boxes are running the latest updates as of 2 days ago from
> up2date.
>
> Below is the cluster.conf and what is seen in the logs.  Any
> suggestions would be greatly appreciated.
>

> Thanks!
>
> Neal
>
>
>
> ##########################################
>
> Cluster.conf
>
> ##########################################
>
>
> <?xml version="
1.0"?>
> <cluster alias="storage1" config_version="4" name="storage1">
>         <fence_daemon post_fail_delay="0" post_join_delay="3"/>

>         <clusternodes>
>                 <clusternode name="fu1" nodeid="1" votes="1">
>                         <fence>
>                                 <method name="1">

>                                         <device name="apc4" port="1"
> switch="1"/>
>                                 </method>
>                         </fence>

>                         <multicast addr=" 224.10.10.10"
> interface="eth0"/>
>                 </clusternode>
>                 <clusternode name="fu2" nodeid="2" votes="1">

>                         <fence>
>                                 <method name="1">
>                                         <device name="apc4" port="2"

> switch="1"/>
>                                 </method>
>                         </fence>
>                         <multicast addr="
224.10.10.10"
> interface="eth0"/>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1">

>                 <multicast addr="224.10.10.10"/>
>                 <totem token="10000"/>
>         </cman>
>         <fencedevices>

>                 <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
> login="apc" name="apc4" passwd="apc"/>

>         </fencedevices>
>         <rm>
>                 <failoverdomains/>
>                 <resources/>
>         </rm>
> </cluster>
>
>

> #####################################################
>
> /var/log/messages
>
> #####################################################
>
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the

> OPERATIONAL state.
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket
> recv buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket

> send buffer size (262142 bytes).
> Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 2.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from
> 0.

> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token
> because I am the rep.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high
> seq received 6e
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.

> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
> 192.168.14.195:
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep

> 192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e
> received flag 0
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate

> any messages in recovery.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for
> ring 14
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE

> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
> Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec

> post_fail_delay
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"

> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> ip(192.168.14.197)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the

> primary component and will provide service.
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)

> ip(192.168.14.195)
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the

> primary component and will provide service.
> Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
> Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
> 
192.168.14.195
> Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from
> node 1
> Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:

> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:

> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:

> Trying to acquire journal lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:

> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Looking at journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Acquiring the transaction lock...

> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replaying journal...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:

> Found 0 revoke tags
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Journal replayed in 1s
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> Done

> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replaying journal...

> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Replayed 0 of 0 blocks
> Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Found 0 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:

> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> Done
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Acquiring the transaction lock...

> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Acquiring the transaction lock...
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replaying journal...

> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Replayed 222 of 223 blocks
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:

> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> Done
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replaying journal...

> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Replayed 438 of 439 blocks
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Found 1 revoke tags
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:

> Journal replayed in 1s
> Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> Done
>
>
> --
> Linux-cluster mailing list
> 
Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster