On Thu, May 4, 2017 at 7:27 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > +ceph-devel > > On Thu, May 4, 2017 at 12:51 AM, James Poole <james.poole@xxxxxxxxxxxxx> wrote: >> Hello, >> >> We currently have a ceph cluster supporting an Openshift cluster using >> cephfs and dynamic rbd provisioning. The client nodes appear to be >> triggering a kernel bug and are rebooting unexpectedly with the same message >> each time. Clients are running CentOS 7: >> >> KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.10.2.el7.x86_64/vmlinux >> DUMPFILE: /var/crash/127.0.0.1-2017-05-02-09:06:17/vmcore [PARTIAL >> DUMP] >> CPUS: 16 >> DATE: Tue May 2 09:06:15 2017 >> UPTIME: 00:43:14 >> LOAD AVERAGE: 1.52, 1.40, 1.48 >> TASKS: 7408 >> NODENAME: [redacted] >> RELEASE: 3.10.0-514.10.2.el7.x86_64 >> VERSION: #1 SMP Fri Mar 3 00:04:05 UTC 2017 >> MACHINE: x86_64 (1997 Mhz) >> MEMORY: 32 GB >> PANIC: "kernel BUG at fs/ceph/inode.c:1197!" >> PID: 133 >> COMMAND: "kworker/1:1" >> TASK: ffff8801399bde20 [THREAD_INFO: ffff880138d0c000] >> CPU: 1 >> STATE: TASK_RUNNING (PANIC) >> >> [ 2596.061470] ------------[ cut here ]------------ >> [ 2596.061499] kernel BUG at fs/ceph/inode.c:1197! >> [ 2596.061516] invalid opcode: 0000 [#1] SMP >> [ 2596.061535] Modules linked in: cfg80211 rfkill binfmt_misc veth ext4 >> mbcache jbd2 rbd xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 >> xt_mark ipt_MASQUERADE nf_nat_masquerad >> e_ipv4 xt_addrtype br_netfilter bridge stp llc dm_thin_pool >> dm_persistent_data dm_bio_prison dm_bufio loop fuse ceph libceph >> dns_resolver vport_vxlan vxlan ip6_udp_tunnel udp_tunnel op >> envswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 iptable_nat >> nf_nat_ipv4 nf_nat xt_limit nf_log_ipv4 vmw_vsock_vmci_transport >> nf_log_common xt_LOG vsock nf_conntrack_ipv4 nf_defr >> ag_ipv4 xt_comment xt_multiport xt_conntrack nf_conntrack iptable_filter >> intel_powerclamp coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel >> aesni_intel lrw gf128mul glue_helper ablk_h >> elper cryptd ppdev vmw_balloon pcspkr sg vmw_vmci shpchp i2c_piix4 >> parport_pc >> [ 2596.061875] parport nfsd nfs_acl lockd auth_rpcgss grace sunrpc >> ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic >> ata_generic pata_acpi vmwgfx drm_kms_helper >> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul >> crct10dif_common mptspi crc32c_intel drm ata_piix scsi_transport_spi >> serio_raw mptscsih libata mptbase vmxnet3 i2c_c >> ore fjes dm_mirror dm_region_hash dm_log dm_mod >> [ 2596.062042] CPU: 1 PID: 133 Comm: kworker/1:1 Not tainted >> 3.10.0-514.10.2.el7.x86_64 #1 >> [ 2596.062070] Hardware name: VMware, Inc. VMware Virtual Platform/440BX >> Desktop Reference Platform, BIOS 6.00 09/17/2015 >> [ 2596.062118] Workqueue: ceph-msgr ceph_con_workfn [libceph] >> [ 2596.062140] task: fffdf8801399be20 ti: ffff880138d0c000 task.ti: >> ffff880138d0c000 >> [ 2596.062166] RIP: 0010:[<ffffffffa05d96c3>] [<ffffffffa05d96c3>] >> ceph_fill_trace+0x893/0xa00 [ceph] >> [ 2596.062209] RSP: 0000:ffff880138d0fb80 EFLAGS: 00010287 >> [ 2596.062230] RAX: ffff88083b079680 RBX: ffff8801efe86760 RCX: >> ffff880095e26c00 >> [ 2596.062257] RDX: ffff880003e8f2c0 RSI: ffff88053b4c0a08 RDI: >> ffff88053b4c0a00 >> [ 2596.062288] RBP: ffff880138d0fbf8 R08: ffff880003e8f2c0 R09: >> 0000000000000000 >> [ 2596.062320] R10: 0000000000000001 R11: ffff8804256f3ac0 R12: >> ffff880121d15400 >> [ 2596.062351] R13: ffff880138dd4000 R14: ffff88007053f280 R15: >> ffff8807ee10f2c0 >> [ 2596.062379] FS: 0000000000000000(0000) GS:ffff88013b840000(0000) >> knlGS:0000000000000000 >> [ 2596.062413] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 2596.062436] CR2: 00007fe3bab2dcd0 CR3: 000000042ebe0000 CR4: >> 00000000001407e0 >> [ 2596.062498] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 >> [ 2596.062540] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: >> 0000000000000400 >> [ 2596.062567] Stack: >> [ 2596.062578] ffff880121d15778 ffff880121d15718 ffff880138d0fc50 >> ffff880095e26e7a >> [ 2596.062612] ffff880035c12400 ffff88053b4c7800 000000003b4c0800 >> ffff880138d0fbb8 >> [ 2596.062645] ffff880138d0fbb8 00000000a5446715 ffff88053b4c0800 >> ffff88008238ee10 >> [ 2596.062681] Call Trace: >> [ 2596.062703] [<ffffffffa05f96a8>] handle_reply+0x3e8/0xc80 [ceph] >> [ 2596.062736] [<ffffffffa05fbd39>] dispatch+0xd9/0xaf0 [ceph] >> [ 2596.062762] [<ffffffff815559ca>] ? kernel_recvmsg+0x3a/0x50 >> [ 2596.062790] [<ffffffffa057ceff>] try_read+0x4bf/0x1220 [libceph] >> [ 2596.062819] [<ffffffffa057b743>] ? try_write+0xa13/0xe60 [libceph] >> [ 2596.062851] [<ffffffffa057dd19>] ceph_con_workfn+0xb9/0x650 [libceph] >> [ 2596.062878] [<ffffffff810a810b>] process_one_work+0x17b/0x470 >> [ 2596.062902] [<ffffffff810a8f46>] worker_thread+0x126/0x410 >> [ 2596.062925] [<ffffffff810a8e20>] ? rescuer_thread+0x460/0x460 >> [ 2596.062949] [<ffffffff810b06ff>] kthread+0xcf/0xe0 >> [ 2596.064014] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 >> [ 2596.065010] [<ffffffff81696a58>] ret_from_fork+0x58/0x90 >> [ 2596.065955] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 >> [ 2596.066945] Code: e8 c3 2b d6 e0 e9 ca fa ff ff 4c 89 fa 48 c7 c6 07 d0 >> 60 a0 48 c7 c7 50 24 61 a0 31 c0 e8 a6 2b d6 e0 e9 cd fa ff ff 0f 0b 0f 0b >> <0f> 0b 0f 0b 48 8b 83 c8 fc ff ff >> 4c 8b 89 c8 fc ff ff 4c 89 fa >> [ 2596.069127] RIP [<ffffffffa05d96c3>] ceph_fill_trace+0x893/0xa00 [ceph] >> [ 2596.070120] RSP <ffff880138d0fb80> >> this issue is fixed by upstream commit 3dd69aabc "ceph: add a new flag to indicate whether parent is locked". But we haven't backported it to rhel kernel. Can you use 4.10 kernel instead Regards Yan, Zheng >> >> Just before the above there are lots of messages similar to this from all >> ceph node ips: >> [ 933.282441] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:35:28:f1:08:00 SRC=192.168.5.6 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=20778 DF P >> ROTO=TCP SPT=6816 DPT=47140 WINDOW=2406 RES=0x00 ACK FIN URGP=0 >> [ 933.922440] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:35:28:f1:08:00 SRC=192.168.5.6 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=1440 DF PR >> OTO=TCP SPT=6800 DPT=56290 WINDOW=2889 RES=0x00 ACK FIN URGP=0 >> [ 934.031555] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:26:f3:39:08:00 SRC=192.168.5.7 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=58232 DF P >> ROTO=TCP SPT=6812 DPT=59564 WINDOW=8433 RES=0x00 ACK FIN URGP=0 >> [ 934.031579] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:26:f3:39:08:00 SRC=192.168.5.7 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=20084 DF P >> ROTO=TCP SPT=6816 DPT=55574 WINDOW=2925 RES=0x00 ACK FIN URGP=0 >> [ 934.105440] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:37:f8:4c:08:00 SRC=192.168.5.4 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=48428 DF P >> ROTO=TCP SPT=6804 DPT=59156 WINDOW=6422 RES=0x00 ACK FIN URGP=0 >> [ 935.133060] [IPTABLES:INPUT] dropped IN=eno33557248 OUT= >> MAC=00:50:56:0f:9a:47:00:50:56:0d:13:27:08:00 SRC=192.168.5.3 >> DST=192.168.3.2 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=35384 DF P >> ROTO=TCP SPT=6817 DPT=52674 WINDOW=24576 RES=0x00 ACK FIN URGP=0 >> >> Many thanks >> >> James >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Cheers, > Brad > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html