Hard lockups during file transfer to GNBD/GFS device

"David Brieck Jr." <dbrieck@xxxxxxxxx> · Thu, 28 Sep 2006 12:15:43 -0400

Here is our setup: 2 GNBD servers attached to a shared SCSI array. Each (of 9) nodes uses multipath to import the shared device from both servers. We are also using GFS on to of that for our shared storage.

What is happening is that I need to transfer a large number of files (about 
1.5 million) from a nodes local storage to the network storage. I'm using rsync locally to move all the files. Orginally my problem was that the oom killer would start running partway through the transfer and the machine would then be unusable (however it was still up enough that it wasn't fenced). Here is that log:

Sep 27 12:21:43 db2 kernel: oom-killer: gfp_mask=0xd0
Sep 27 12:21:43 db2 kernel: Mem-info:
Sep 27 12:21:43 db2 kernel: DMA per-cpu:
Sep 27 12:21:43 db2 kernel: cpu 0 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 0 cold: low 0, high 2, batch 1

Sep 27 12:21:43 db2 kernel: cpu 1 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 1 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 2 cold: low 0, high 2, batch 1

Sep 27 12:21:43 db2 kernel: cpu 3 hot: low 2, high 6, batch 1
Sep 27 12:21:43 db2 kernel: cpu 3 cold: low 0, high 2, batch 1
Sep 27 12:21:43 db2 kernel: cpu 4 hot: low 2, high 6, batch 1
Sep 27 12:21:44 db2 kernel: cpu 4 cold: low 0, high 2, batch 1

Sep 27 12:21:53 db2 in[15473]: 1159374113||chericee@xxxxxxxxxxxxxx|2852|timeout|1
Sep 27 12:21:54 db2 kernel: cpu 5 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 5 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 6 hot: low 2, high 6, batch 1

Sep 27 12:21:54 db2 kernel: cpu 6 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 hot: low 2, high 6, batch 1
Sep 27 12:21:54 db2 kernel: cpu 7 cold: low 0, high 2, batch 1
Sep 27 12:21:54 db2 kernel: Normal per-cpu:

Sep 27 12:21:54 db2 kernel: cpu 0 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 0 cold: low 0, high 32, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 hot: low 32, high 96, batch 16
Sep 27 12:21:54 db2 kernel: cpu 1 cold: low 0, high 32, batch 16

Sep 27 12:21:54 db2 kernel: cpu 2 hot: low 32, high 96, batch 16
Sep 27 12:27:59 db2 syslogd 1.4.1: restart.
Sep 27 12:27:59 db2 syslog: syslogd startup succeeded
Sep 27 12:27:59 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Sep 27 12:27:59 db2 kernel: Linux version 2.6.9-42.0.2.ELsmp (buildsvn@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-3))
#1 SMP Wed Aug 23 00:17:26 CDT 2006

I found a few postings saying that using the hugemem kernel would solve the problems (they claimed it was a known SMP bug by redhat) so all my systems are now running on that kernel. It did solve the out of memory problem, but it seems to have introduced some new ones. Here are the logs from the most recent crashes:

Sep 28 11:15:05 db2 kernel: do_IRQ: stack overflow: 412
Sep 28 11:15:05 db2 kernel:  [<02107c6b>] do_IRQ+0x49/0x1ae<1>Unable to handle kernel NULL pointer dereference at virtual address
00000000

Sep 28 11:15:05 db2 kernel:  printing eip:
Sep 28 11:15:05 db2 kernel: 0212928c
Sep 28 11:15:05 db2 kernel: *pde = 00004001
Sep 28 11:15:05 db2 kernel: Oops: 0002 [#1]
Sep 28 11:15:05 db2 kernel: SMP
Sep 28 11:15:05 db2 kernel: Modules linked in: mptctl mptbase dell_rbu nfsd exportfs lockd nfs_acl parport_pc lp parport autofs4 i

2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dm_round_robin gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandl
er iptable_filter iptable_mangle iptable_nat ip_conntrack ip_tables md5 ipv6 dm_multipath joydev button battery ac uhci_hcd ehci_h

cd hw_random e1000 bonding(U) floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod
Sep 28 11:15:05 db2 kernel: CPU:    1548750336
Sep 28 11:15:05 db2 kernel: EIP:    0060:[<0212928c>]    Not tainted VLI

Sep 28 11:15:05 db2 kernel: EFLAGS: 00010002   (2.6.9-42.0.2.ELhugemem)
Sep 28 11:15:05 db2 kernel: EIP is at internal_add_timer+0x84/0x8c
Sep 28 11:15:05 db2 kernel: eax: 00000000   ebx: 023b7900   ecx: 023b8680   edx: 02447620

Sep 28 11:15:05 db2 kernel: esi: 00000000   edi: 023b7900   ebp: 02ee0c94   esp: 48552fb4
Sep 28 11:15:05 db2 kernel: ds: 007b   es: 007b   ss: 0068
Sep 28 11:15:05 db2 kernel: Process  (pid: 1, threadinfo=48552000 task=6d641a00)

Sep 28 11:17:54 db2 syslogd 1.4.1: restart.
Sep 28 11:17:54 db2 syslog: syslogd startup succeeded
Sep 28 11:17:54 db2 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 28 11:17:54 db2 syslog: klogd startup succeeded

Sep 28 11:17:54 db2 kernel: Linux version 2.6.9-42.0.2.ELhugemem (buildsvn@build-i386) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-
3)) #1 SMP Wed Aug 23 00:38:38 CDT 2006

The GNBD servers stay online and don't have any problems, it's just the client where all the trouble is coming from. Is this a bug or is something not setup right?

If you need more info I'll be happy to provide it.

Thanks.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster