[Linux-cluster] (no subject)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We have a cluster of two rh 2.6.7 smp machines using gfs and we exprerience
random stability issues.

Every 2 days or so, a lock_dlm error message is dumped to the log (see
below).
At this point, either both machines are unable to access the gfs file system
(hanging on ls, df, ...), or a random process that was accessing a file is
hanging on one of the machine (always a different process, can be tar, gzip,
mv, ...) and cannot be terminated.
At this point the only thing we can do is reboot both nodes.

We haven't found a way to reproduce this problem, it seems to happen
randomly.
We have done the following to eliminate the problem (without success nor
improvement):

 - Shutdown machine A and run all services on machine B
 - Shutdown machine B and run all services on machine A
 - Disable heavy I/O on both machines (mainly full daily backups)

The error message is the following:

------

Sep 13 15:05:43 L1_OAS56_B kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000005
Sep 13 15:05:43 L1_OAS56_B kernel:  printing eip:
Sep 13 15:05:43 L1_OAS56_B kernel: c013a1f6
Sep 13 15:05:43 L1_OAS56_B kernel: *pde = 17aea001
Sep 13 15:05:43 L1_OAS56_B kernel: Oops: 0002 [#1]
Sep 13 15:05:43 L1_OAS56_B kernel: SMP
Sep 13 15:05:43 L1_OAS56_B kernel: Modules linked in: nfsd exportfs ipv6
autofs e1000 af_packet parport_pc parport ohci_hcd ehci_hcd lock_dlm dlm
cman gfs lock_harness dm_mod floppy uhci_hcd usbcore thermal processor fan
button battery asus_acpi ac ext3 jbd loop ide_cd cdrom qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod i2o_block i2o_core
Sep 13 15:05:43 L1_OAS56_B kernel: CPU:    2
Sep 13 15:05:43 L1_OAS56_B kernel: EIP:    0060:[<c013a1f6>]    Not tainted
Sep 13 15:05:43 L1_OAS56_B kernel: EFLAGS: 00010083   (2.6.7)
Sep 13 15:05:43 L1_OAS56_B kernel: EIP is at find_get_pages+0x41/0x5a
Sep 13 15:05:43 L1_OAS56_B kernel: eax: 00000001   ebx: d6d2de4c   ecx:
00000010   edx: 00000004
Sep 13 15:05:43 L1_OAS56_B kernel: esi: f274a724   edi: e00f2240   ebp:
d6d2ddfc   esp: d6d2dde4
Sep 13 15:05:43 L1_OAS56_B kernel: ds: 007b   es: 007b   ss: 0068
Sep 13 15:05:43 L1_OAS56_B kernel: Process lock_dlm (pid: 1575,
threadinfo=d6d2c000 task=f7b945c0)
Sep 13 15:05:43 L1_OAS56_B kernel: Stack: f274a728 d6d2de4c 00000000
00000010 d6d2de44 f274a724 d6d2de18 c01441ed
Sep 13 15:05:43 L1_OAS56_B kernel:        f274a724 00000000 00000010
d6d2de4c 00000000 d6d2dea0 c01444d0 d6d2de44
Sep 13 15:05:43 L1_OAS56_B kernel:        f274a724 00000000 00000010
c3207870 00000000 d6d2c000 00000000 00000000
Sep 13 15:05:43 L1_OAS56_B kernel: Call Trace:
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0106c6b>] show_stack+0x80/0x96
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0106e02>] show_registers+0x15f/0x1ae
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0106f77>] die+0x8d/0xfb
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0117e86>] do_page_fault+0x270/0x579
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0106911>] error_code+0x2d/0x38
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c01441ed>] pagevec_lookup+0x2c/0x35
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c01444d0>]
truncate_inode_pages+0x71/0x29f
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa9bdc40>] gfs_inval_buf+0x45/0x88
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa9cd06b>] inode_go_inval+0x45/0x4f
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa9c9ec3>] drop_bh+0x15f/0x1d6 [gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa9cb4bd>] gfs_glock_cb+0x167/0x1f4
[gfs]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa928ace>]
process_complete+0x103/0x34c [lock_dlm]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<fa928ee2>] dlm_async+0x1cb/0x290
[lock_dlm]
Sep 13 15:05:43 L1_OAS56_B kernel:  [<c0104291>]
kernel_thread_helper+0x5/0xb
Sep 13 15:05:43 L1_OAS56_B kernel:
Sep 13 15:05:43 L1_OAS56_B kernel: Code: f0 ff 40 04 83 c2 01 39 ca 72 f2 c6
46 10 01 fb 83 c4 10 5b

------

Any idea of what's wrong or what we should we check next?
Is it possible to "unlock" the machines after such an error without reboot?

The release version is DEVEL.1090589850.

Thanks for your help,

Stéphane Messerli
Senior Support & Project Engineer, Technology Europe
smesserli@xxxxxxxxxxxxx

24/7 Real Media (NASDAQ: TFSM)
Route de la Pierre
1024 Ecublens
Switzerland

tel. +41 21 695 97 46
fax +41 21 695 97 01
	




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux