GFS experts,
I'm looking for help to debug
this crash. I'm running GFS with low journal size. I understand this is bit
unchartered territory for GFS, but we have a need to be really stingy on journal
overhead.
GFS is consistently crashing with same stack
trace, seems related to journal log being committed to the disk. gfs_tool
counters show log-space-used close to 100%. iostat shows a huge write just
before the crash. Probably the journal blocks being flushed to
disk?
I tried the workaround to reduce
"max_atomic_write" value listed in https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146672. But no
luck.
Could this be a GFS bug that gets exercised
only on small journal size?
Full stack trace, gfs counters and iostat
data listed below. Appreciate any help!
- Sridharan
Cisco Systems, Inc
---------------------
Logs:
Feb 14 11:39:33 cfs1 kernel: GFS: fsid=cisco:gfs2.0: jid=1: Looking at journal...
Feb 14 11:39:33 cfs1 kernel: GFS: fsid=cisco:gfs2.0: jid=1: Done
Feb 14 11:39:33 cfs1 kernel: GFS: fsid=cisco:gfs2.0: jid=2: Trying to acquire journal lock...
Feb 14 11:39:33 cfs1 kernel: GFS: fsid=cisco:gfs2.0: jid=2: Looking at journal...
Feb 14 11:39:33 cfs1 kernel: GFS: fsid=cisco:gfs2.0: jid=2: Done
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: head_off = 61856,
head_wrap = 8
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: dump_off = 61856, dump_wrap = 7
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: assertion "FALSE" failed
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: function = check_seg_usage
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: file = /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/log.c, line = 590
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: time = 1171482110
Feb 14 11:41:50 cfs1 kernel: ------------[ cut here ]------------
Feb 14 11:41:50 cfs1 kernel: kernel BUG at /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/util.c:211!
Feb 14 11:41:50 cfs1 kernel: invalid operand: 0000 [#1]
Feb 14 11:41:50 cfs1 kernel: SMP
Feb 14 11:41:50 cfs1 kernel: Modules linked in: lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) nfsd exp
ortfs nfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod bu
tton battery ac uhci_hcd ehci_hcd hw_random e1000 e100 mii floppy ext3 jbd
Feb 14 11:41:50 cfs1 kernel: CPU: 0
Feb 14 11:41:50 cfs1 kernel: EIP: 0060:[<e10864f6>] Tainted: GF VLI
Feb 14 11:41:50 cfs1 kernel: EFLAGS: 00010246 (2.6.9-42.7.ELsmp)
Feb 14 11:41:50 cfs1 kernel: EIP is at gfs_assert_i+0x38/0x69 [gfs]
Feb 14 11:41:50 cfs1 kernel: eax: 000000f8 ebx: e0c9e000 ecx: c7cb3b7c edx: e108d0a2
Feb 14 11:41:50 cfs1 kernel: esi: e108702a edi: e108a021 ebp: 0000024e esp: c7cb3b78
Feb 14 11:41:50 cfs1 kernel: ds: 007b es: 007b ss: 0068
Feb 14 11:41:50 cfs1 kernel: Process bonnie++ (pid: 8091, threadinfo=c7cb3000 task=df3c43b0)
Feb 14 11:41:50 cfs1 kernel: Stack: e108d0a2 e0cc2734 e108a021 e0cc2734 e108702a e0cc2734 e1089e34 0000024e
Feb 14 11:41:50 cfs1 kernel: e0cc2734 45d365fe 00000000 0000f190 c48b9080 e0c9e000 e106e880 e1089e34
Feb 14 11:41:50 cfs1 kernel: e0cc2734 45d365fe 00000000 0000f190 c48b9080 e0c9e000 e106e880 e1089e34
Feb 14 11:41:50 cfs1 kernel: 0000024e 00000007 00000000 0000f1a0 00000000 0000f1c0 00000000 00000008
Feb 14 11:41:50 cfs1 kernel: 0000024e 00000007 00000000 0000f1a0 00000000 0000f1c0 00000000 00000008
Feb 14 11:41:50 cfs1 kernel: Call Trace:
Feb 14 11:41:50 cfs1 kernel: [<e106e880>] check_seg_usage+0x197/0x19f [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106ea03>] sync_trans+0x143/0x1b1 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1071641>] quota_trans_size+0x20/0x36 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106ebdd>] disk_commit+0xec/0x264 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106edf6>] log_refund+0x61/0x187 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106f235>] log_flush_internal+0xec/0x19e [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106e131>] gfs_log_reserve+0x19e/0x20e [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1063f77>] glock_wait_internal+0x1e3/0x1ef [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10642e1>] gfs_glock_nq+0xe3/0x116 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1085334>] gfs_trans_begin_i+0xfd/0x15a [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106842e>] inode_init_and_link+0x1fe/0x388 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10648a3>] gfs_glock_nq_init+0x13/0x26 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10648f4>] gfs_glock_nq_num+0x2e/0x71 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1068767>] gfs_createi+0x1af/0x1f1 [gfs]
Feb 14 11:41:51 cfs1 kernel: [<e107a9d0>] gfs_create+0x68/0x16f [gfs]
Feb 14 11:41:51 cfs1 kernel: [<c0167bd0>] vfs_create+0xbc/0x103
Feb 14 11:41:51 cfs1 kernel: [<c0167fa8>] open_namei+0x177/0x579
Feb 14 11:41:51 cfs1 kernel: [<c015a442>] filp_open+0x45/0x70
Feb 14 11:41:51 cfs1 kernel: [<c02d2e66>] __cond_resched+0x14/0x39
Feb 14 11:41:51 cfs1 kernel: [<c01c3326>] direct_strncpy_from_user+0x3e/0x5d
Feb 14 11:41:51 cfs1 kernel: [<c015a79e>] sys_open+0x31/0x7d
Feb 14 11:41:51 cfs1 kernel: [<c015a7fc>] sys_creat+0x12/0x16
Feb 14 11:41:51 cfs1 kernel: [<c02d48bb>] syscall_call+0x7/0xb
Feb 14 11:41:51 cfs1 kernel: [<c02d007b>] packet_rcv+0x8e/0x307
Feb 14 11:41:51 cfs1 kernel: Code: c3 83 b8 30 02 00 00 00 74 2c ff 35 90 7d 43 c0 8d 80 34 47 02 00 50 55
ff 74 24 20 50 51 50 52 50 68 a2 d0 08 e1 e8 ac c3 09 df <0f> 0b d3 00 25 d1 08 e1 83 c4 28 e8 9d f8 07 d
f ff 35 90 7d 43
Feb 14 11:41:51 cfs1 kernel: <0>Fatal exception: panic in 5 seconds
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: dump_off = 61856, dump_wrap = 7
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: assertion "FALSE" failed
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: function = check_seg_usage
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: file = /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/log.c, line = 590
Feb 14 11:41:50 cfs1 kernel: GFS: fsid=cisco:gfs2.0: time = 1171482110
Feb 14 11:41:50 cfs1 kernel: ------------[ cut here ]------------
Feb 14 11:41:50 cfs1 kernel: kernel BUG at /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/util.c:211!
Feb 14 11:41:50 cfs1 kernel: invalid operand: 0000 [#1]
Feb 14 11:41:50 cfs1 kernel: SMP
Feb 14 11:41:50 cfs1 kernel: Modules linked in: lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) nfsd exp
ortfs nfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod bu
tton battery ac uhci_hcd ehci_hcd hw_random e1000 e100 mii floppy ext3 jbd
Feb 14 11:41:50 cfs1 kernel: CPU: 0
Feb 14 11:41:50 cfs1 kernel: EIP: 0060:[<e10864f6>] Tainted: GF VLI
Feb 14 11:41:50 cfs1 kernel: EFLAGS: 00010246 (2.6.9-42.7.ELsmp)
Feb 14 11:41:50 cfs1 kernel: EIP is at gfs_assert_i+0x38/0x69 [gfs]
Feb 14 11:41:50 cfs1 kernel: eax: 000000f8 ebx: e0c9e000 ecx: c7cb3b7c edx: e108d0a2
Feb 14 11:41:50 cfs1 kernel: esi: e108702a edi: e108a021 ebp: 0000024e esp: c7cb3b78
Feb 14 11:41:50 cfs1 kernel: ds: 007b es: 007b ss: 0068
Feb 14 11:41:50 cfs1 kernel: Process bonnie++ (pid: 8091, threadinfo=c7cb3000 task=df3c43b0)
Feb 14 11:41:50 cfs1 kernel: Stack: e108d0a2 e0cc2734 e108a021 e0cc2734 e108702a e0cc2734 e1089e34 0000024e
Feb 14 11:41:50 cfs1 kernel: e0cc2734 45d365fe 00000000 0000f190 c48b9080 e0c9e000 e106e880 e1089e34
Feb 14 11:41:50 cfs1 kernel: e0cc2734 45d365fe 00000000 0000f190 c48b9080 e0c9e000 e106e880 e1089e34
Feb 14 11:41:50 cfs1 kernel: 0000024e 00000007 00000000 0000f1a0 00000000 0000f1c0 00000000 00000008
Feb 14 11:41:50 cfs1 kernel: 0000024e 00000007 00000000 0000f1a0 00000000 0000f1c0 00000000 00000008
Feb 14 11:41:50 cfs1 kernel: Call Trace:
Feb 14 11:41:50 cfs1 kernel: [<e106e880>] check_seg_usage+0x197/0x19f [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106ea03>] sync_trans+0x143/0x1b1 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1071641>] quota_trans_size+0x20/0x36 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106ebdd>] disk_commit+0xec/0x264 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106edf6>] log_refund+0x61/0x187 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106f235>] log_flush_internal+0xec/0x19e [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106e131>] gfs_log_reserve+0x19e/0x20e [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1063f77>] glock_wait_internal+0x1e3/0x1ef [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10642e1>] gfs_glock_nq+0xe3/0x116 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1085334>] gfs_trans_begin_i+0xfd/0x15a [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e106842e>] inode_init_and_link+0x1fe/0x388 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10648a3>] gfs_glock_nq_init+0x13/0x26 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e10648f4>] gfs_glock_nq_num+0x2e/0x71 [gfs]
Feb 14 11:41:50 cfs1 kernel: [<e1068767>] gfs_createi+0x1af/0x1f1 [gfs]
Feb 14 11:41:51 cfs1 kernel: [<e107a9d0>] gfs_create+0x68/0x16f [gfs]
Feb 14 11:41:51 cfs1 kernel: [<c0167bd0>] vfs_create+0xbc/0x103
Feb 14 11:41:51 cfs1 kernel: [<c0167fa8>] open_namei+0x177/0x579
Feb 14 11:41:51 cfs1 kernel: [<c015a442>] filp_open+0x45/0x70
Feb 14 11:41:51 cfs1 kernel: [<c02d2e66>] __cond_resched+0x14/0x39
Feb 14 11:41:51 cfs1 kernel: [<c01c3326>] direct_strncpy_from_user+0x3e/0x5d
Feb 14 11:41:51 cfs1 kernel: [<c015a79e>] sys_open+0x31/0x7d
Feb 14 11:41:51 cfs1 kernel: [<c015a7fc>] sys_creat+0x12/0x16
Feb 14 11:41:51 cfs1 kernel: [<c02d48bb>] syscall_call+0x7/0xb
Feb 14 11:41:51 cfs1 kernel: [<c02d007b>] packet_rcv+0x8e/0x307
Feb 14 11:41:51 cfs1 kernel: Code: c3 83 b8 30 02 00 00 00 74 2c ff 35 90 7d 43 c0 8d 80 34 47 02 00 50 55
ff 74 24 20 50 51 50 52 50 68 a2 d0 08 e1 e8 ac c3 09 df <0f> 0b d3 00 25 d1 08 e1 83 c4 28 e8 9d f8 07 d
f ff 35 90 7d 43
Feb 14 11:41:51 cfs1 kernel: <0>Fatal exception: panic in 5 seconds
[root@cfs1 ~]$ gfs_tool counters /mnt/gfs2
locks
11451
locks held 11450
freeze count 0
incore inodes 5713
metadata buffers 5838
unlinked inodes 1
quota IDs 2
incore log buffers 893
log space used 98.41%
meta header cache entries 53
glock dependencies 0
glocks on reclaim list 0
log wraps 7
outstanding LM calls 0
outstanding BIO calls 0
fh2dentry misses 0
glocks reclaimed 9
glock nq calls 131554
glock dq calls 125830
glock prefetch calls 0
lm_lock calls 11484
lm_unlock calls 12
lm callbacks 11496
address operations 0
dentry operations 4
export operations 0
file operations 17177
inode operations 45696
super operations 45690
vm operations 0
block I/O reads 52
block I/O writes 17269
locks held 11450
freeze count 0
incore inodes 5713
metadata buffers 5838
unlinked inodes 1
quota IDs 2
incore log buffers 893
log space used 98.41%
meta header cache entries 53
glock dependencies 0
glocks on reclaim list 0
log wraps 7
outstanding LM calls 0
outstanding BIO calls 0
fh2dentry misses 0
glocks reclaimed 9
glock nq calls 131554
glock dq calls 125830
glock prefetch calls 0
lm_lock calls 11484
lm_unlock calls 12
lm callbacks 11496
address operations 0
dentry operations 4
export operations 0
file operations 17177
inode operations 45696
super operations 45690
vm operations 0
block I/O reads 52
block I/O writes 17269
[root@cfs1 ~]$ iostat 2 1000 | grep hda12
Device:
tps Blk_read/s Blk_wrtn/s
Blk_read
Blk_wrtn
hda12 0.42 1.52 1.84 2242 2720
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 2.51 0.00 20.10 0 40
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 4467.68 0.00 35741.41 0 70768
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.42 1.52 1.84 2242 2720
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 2.51 0.00 20.10 0 40
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
hda12 4467.68 0.00 35741.41 0 70768
hda12 0.00 0.00 0.00 0 0
hda12 0.00 0.00 0.00 0 0
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster