We're testing cmirror on 4.6 at the moment. Yesterday, one of our cluster nodes lost sight of an AoE disk. Unfortunately, cmirror wasn't able to recover: Jan 14 12:26:14 node06 lvm[3391]: Mirror device, 253:9, has failed. Jan 14 12:26:14 node06 lvm[3391]: Device failure in lvm_cluster1-desktop--home--1 Jan 14 12:26:14 node06 lvm[3391]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Jan 14 12:26:14 node06 lvm[3391]: WARNING: dev_open(/dev/ramdisk) called while suspended Jan 14 12:26:14 node06 lvm[3391]: WARNING: dev_open(/dev/cdrom) called while suspended [ Lots more similar WARNINGs] Jan 14 12:26:14 node06 lvm[3391]: Another thread is handling an event. Waiting... Jan 14 12:26:16 node06 lvm[3391]: Failed to remove faulty devices in lvm_cluster1-desktop--home--1 Jan 14 12:26:16 node06 lvm[3391]: Mirror device, 253:14, has failed. Jan 14 12:26:16 node06 lvm[3391]: Device failure in lvm_cluster1-var Jan 14 12:26:16 node06 lvm[3391]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Jan 14 12:26:16 node06 lvm[3391]: Another thread is handling an event. Waiting... Jan 14 12:26:16 node06 lvm[3391]: WARNING: dev_open(/dev/ramdisk) called while suspended Jan 14 12:26:16 node06 lvm[3391]: WARNING: dev_open(/dev/cdrom) called while suspended Jan 14 12:26:16 node06 lvm[3391]: WARNING: dev_open(/dev/loop0) called while suspended [ Lots more similar WARNINGs] Jan 14 12:26:18 node06 lvm[3391]: Failed to remove faulty devices in lvm_cluster1-var Jan 14 12:26:18 node06 lvm[3391]: Mirror device, 253:9, has failed. This keeps looping. A couple of minutes later, though: Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: fatal: filesystem consistency error Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: inode = 4391966/4391966 Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: function = gfs_change_nlink Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: file = /builddir/build/BUILD/gfs-kernel-2.6.9-75/hugemem/src/gfs/inode.c, line = 845 Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: time = 1200313692 Jan 14 12:28:12 node03 kernel: ------------[ cut here ]------------ Jan 14 12:28:12 node03 kernel: kernel BUG at fs/locks.c:1805! Jan 14 12:28:12 node03 kernel: invalid operand: 0000 [#1] Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: about to withdraw from the cluster Jan 14 12:28:12 node03 kernel: SMP GFS: fsid=cluster1:var.3: waiting for outstanding I/O Jan 14 12:28:12 node03 kernel: Jan 14 12:28:12 node03 kernel: GFS: fsid=cluster1:var.3: telling LM to withdraw Jan 14 12:28:12 node03 kernel: Modules linked in: i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) ext3 jbd dm_cmirror(U) dm_mirror dlm(U) cman(U) bonding (U) md5 ipv6 aoe(U) dm_mod button battery ac uhci_hcd ehci_hcd e1000 sd_mod ata_piix libata scsi_mod Jan 14 12:28:12 node03 kernel: CPU: 2 Jan 14 12:28:12 node03 kernel: EIP: 0060:[<0216eb06>] Not tainted VLI Jan 14 12:28:12 node03 kernel: EFLAGS: 00010246 (2.6.9-67.ELhugemem) Jan 14 12:28:12 node03 kernel: EIP is at locks_remove_flock+0xa1/0xe1 Jan 14 12:28:12 node03 kernel: eax: db70724c ebx: db707c0c ecx: daf4e184 edx: 00000001 Jan 14 12:28:12 node03 kernel: esi: 00000000 edi: e110c998 ebp: 5fd4f6c0 esp: d9178f18 Jan 14 12:28:12 node03 kernel: ds: 007b es: 007b ss: 0068 Jan 14 12:28:12 node03 kernel: Process rrdtool (pid: 19569, threadinfo=d9178000 task=5eff4d70) Jan 14 12:28:12 node03 kernel: Stack: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Jan 14 12:28:12 node03 kernel: 00000000 00000000 00000000 00000000 00000202 00000000 00000000 00000000 Jan 14 12:28:12 node03 kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Jan 14 12:28:12 node03 kernel: Call Trace: Jan 14 12:28:12 node03 kernel: [<0215ba4e>] __fput+0x41/0x100 Jan 14 12:28:12 node03 kernel: [<0215a763>] filp_close+0x59/0x5f Jan 14 12:28:12 node03 kernel: [<0215a7c4>] sys_close+0x5b/0x92 Jan 14 12:28:12 node03 kernel: Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 c9 e0 ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 3c ee ff ff eb 0a <0f> 0b 0d 07 f5 54 2e 02 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 Jan 14 12:28:12 node03 kernel: <0>Fatal exception: panic in 5 seconds Jan 14 12:28:16 node03 kernel: lock_dlm: withdraw abandoned memory Jan 14 12:28:16 node03 kernel: GFS: fsid=cluster1:var.3: withdrawn Jan 14 12:28:16 node03 kernel: mh_magic = 0x01161970 Jan 14 12:28:16 node03 kernel: mh_type = 4 Jan 14 12:28:17 node03 kernel: mh_generation = 3008 Jan 14 12:28:17 node03 kernel: mh_format = 400 Jan 14 12:28:17 node03 kernel: mh_incarn = 1252 Jan 14 12:28:17 node03 kernel: no_formal_ino = 4391966 Jan 14 12:28:17 node03 kernel: no_addr = 4391966 Jan 14 12:28:17 node03 kernel: di_mode = 0644 Jan 14 12:28:17 node03 kernel: di_uid = 402 Jan 14 12:28:17 node03 kernel: di_gid = 402 Jan 14 12:28:17 node03 kernel: di_nlink = 0 Jan 14 12:28:17 node03 kernel: di_size = 3975 Jan 14 12:28:17 node03 kernel: di_blocks = 2 Jan 14 12:28:17 node03 kernel: di_atime = 1200313668 Jan 14 12:28:17 node03 kernel: di_mtime = 1200313668 Jan 14 12:28:17 node03 kernel: di_ctime = 1200313687 Jan 14 12:28:17 node03 kernel: di_major = 0 Jan 14 12:28:17 node03 kernel: di_minor = 0 Jan 14 12:28:17 node03 kernel: di_rgrp = 4390969 Jan 14 12:28:17 node03 kernel: di_goal_rgrp = 16384000 Jan 14 12:28:17 node03 kernel: di_goal_dblk = 5788 Jan 14 12:28:17 node03 kernel: di_goal_mblk = 0 Jan 14 12:28:17 node03 kernel: di_flags = 0x00000000 Jan 14 12:28:17 node03 kernel: di_payload_format = 0 Jan 14 12:28:17 node03 kernel: di_type = 1 Jan 14 12:28:17 node03 kernel: di_height = 1 Jan 14 12:28:17 node03 kernel: di_incarn = 0 Jan 14 12:28:17 node03 kernel: di_pad = 0 Jan 14 12:28:17 node03 kernel: di_depth = 0 Jan 14 12:28:17 node03 kernel: di_entries = 0 Jan 14 12:28:17 node03 kernel: no_formal_ino = 0 Jan 14 12:28:17 node03 kernel: no_addr = 0 Jan 14 12:28:17 node03 kernel: di_eattr = 0 Jan 14 12:28:17 node03 kernel: di_reserved = Jan 14 12:28:17 node03 kernel: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Jan 14 12:28:17 node03 last message repeated 2 times Jan 14 12:28:17 node03 kernel: 00 00 00 00 00 00 00 00 Only the one node had problems accessing the AoE disk and all other nodes which didn't try to access the specific directory with the corruption continued to happily read and write to the filesystem, right up until I took the whole cluster down for an fsck (which recovered everything nicely). I assume the kernel BUG is just a result of node03's less than graceful exit from the cluster while rrdtool was holding some locks, but I'm not sure why the mirror didn't degrade to linear nor why we ended up with filesystem corruption. We're running kernel-hugemem-2.6.9-67.EL so we don't have the fix for bz399661, but that affects I/O after a successful recovery, so I don't think it's that. Has anyone got any ideas where I can look next to track this down? Thanks, Robert -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster