I think Ilya recently looked into a bug that can occur when CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes through the loopback interface (i.e. co-located OSDs and krbd). Assuming that you have the same setup, you might be hitting the same bug. On Thu, Jan 10, 2019 at 6:46 PM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote: > > On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman <rom@xxxxxxxxxxxxxxx> wrote: > > > > Hey, > > After upgrading to centos7.6, I started encountering the following kernel panic > > > > [17845.147263] XFS (rbd4): Unmounting Filesystem > > [17846.860221] rbd: rbd4: capacity 3221225472 features 0x1 > > [17847.109887] XFS (rbd4): Mounting V5 Filesystem > > [17847.191646] XFS (rbd4): Ending clean mount > > [17861.663757] rbd: rbd5: capacity 3221225472 features 0x1 > > [17862.930418] usercopy: kernel memory exposure attempt detected from ffff9d54d26d8800 (kmalloc-512) (1024 bytes) > > [17862.941698] ------------[ cut here ]------------ > > [17862.946854] kernel BUG at mm/usercopy.c:72! > > [17862.951524] invalid opcode: 0000 [#1] SMP > > [17862.956123] Modules linked in: vhost_net vhost macvtap macvlan tun xt_REDIRECT nf_nat_redirect ip6table_mangle xt_nat xt_mark xt_connmark xt_CHECKSUM ip6table_raw xt_physdev iptable_mangle veth iptable_raw rbd libceph dns_resolver ebtable_filter ebtables ip6table_filter ip6_tables xt_comment mlx4_en(OE) mlx4_core(OE) xt_multiport ipt_REJECT nf_reject_ipv4 nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc xfs openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack mlx5_core(OE) mlxfw(OE) iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass pcspkr joydev sg mei_me lpc_ich i2c_i801 mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler > > [17863.036328] dm_multipath ip_tables ext4 mbcache jbd2 dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel mgag200 igb aesni_intel isci lrw gf128mul glue_helper ablk_helper ahci drm_kms_helper cryptd libsas dca syscopyarea sysfillrect sysimgblt fb_sys_fops ttm libahci scsi_transport_sas ptp drm libata pps_core mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit devlink wmi scsi_transport_iscsi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mlx4_core] > > [17863.094372] CPU: 3 PID: 71755 Comm: msgr-worker-1 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.1.3.el7.x86_64 #1 > > [17863.107673] Hardware name: Intel Corporation S2600JF/S2600JF, BIOS SE5C600.86B.02.06.0006.032420170950 03/24/2017 > > [17863.119134] task: ffff9d4e8e33e180 ti: ffff9d53dbaf8000 task.ti: ffff9d53dbaf8000 > > [17863.127489] RIP: 0010:[<ffffffffa5e3e167>] [<ffffffffa5e3e167>] __check_object_size+0x87/0x250 > > [17863.137217] RSP: 0018:ffff9d53dbafbb98 EFLAGS: 00010246 > > [17863.143140] RAX: 0000000000000062 RBX: ffff9d54d26d8800 RCX: 0000000000000000 > > [17863.151106] RDX: 0000000000000000 RSI: ffff9d557bad3898 RDI: ffff9d557bad3898 > > [17863.159072] RBP: ffff9d53dbafbbb8 R08: 0000000000000000 R09: 0000000000000000 > > [17863.167038] R10: 0000000000000d0f R11: ffff9d53dbafb896 R12: 0000000000000400 > > [17863.175001] R13: 0000000000000001 R14: ffff9d54d26d8c00 R15: 0000000000000400 > > [17863.182968] FS: 00007f531fa98700(0000) GS:ffff9d557bac0000(0000) knlGS:0000000000000000 > > [17863.192001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [17863.198414] CR2: 00007f4438516930 CR3: 0000000f19236000 CR4: 00000000001627e0 > > [17863.206379] Call Trace: > > [17863.209114] [<ffffffffa5f8c0dd>] memcpy_toiovec+0x4d/0xb0 > > [17863.215240] [<ffffffffa622a858>] skb_copy_datagram_iovec+0x128/0x280 > > [17863.222434] [<ffffffffa629172a>] tcp_recvmsg+0x22a/0xb30 > > [17863.228463] [<ffffffffa62c00e0>] inet_recvmsg+0x80/0xb0 > > [17863.234395] [<ffffffffa62186ec>] sock_aio_read.part.9+0x14c/0x170 > > [17863.241297] [<ffffffffa5cd676b>] ? wake_up_q+0x5b/0x80 > > [17863.247129] [<ffffffffa6218731>] sock_aio_read+0x21/0x30 > > [17863.253157] [<ffffffffa5e40743>] do_sync_read+0x93/0xe0 > > [17863.259087] [<ffffffffa5e41225>] vfs_read+0x145/0x170 > > [17863.264823] [<ffffffffa5e4203f>] SyS_read+0x7f/0xf0 > > [17863.270366] [<ffffffffa6374ddb>] system_call_fastpath+0x22/0x27 > > [17863.277061] Code: 45 d1 48 c7 c6 d4 b6 67 a6 48 c7 c1 e0 4b 68 a6 48 0f 45 f1 49 89 c0 4d 89 e1 48 89 d9 48 c7 c7 d0 1a 68 a6 31 c0 e8 20 d5 51 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c0 00 00 c0 a5 4c 39 f0 73 0d > > [17863.298802] RIP [<ffffffffa5e3e167>] __check_object_size+0x87/0x250 > > [17863.305912] RSP <ffff9d53dbafbb98> > > > > It seems to be related to rbd operations but I cannot pinpoint directly the reason. > > To me this seems to be an issue in the networking subsystem and there > is nothing, at this stage, that implicates the ceph modules. > > If the Mellanox modules are involved in any way I would start looking > there (not because I am biased against them, but because experience > tells me that is the place to start) and then move on to the other > networking modules and the kernel more generally. This looks like some > sort of memory accounting error in the networking subsystem. I could > be wrong, of course, but there would need to be further data to tell > either way. I'd suggest capturing a vmcore and getting someone to > analyse it for you would be a good next step. > > > > > Versions: > > CentOS Linux release 7.6.1810 (Core) > > Linux stratonode1.node.strato 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux > > > > librbd1-12.2.8-0.el7.x86_64 > > > > > > [root@stratonode1 ~]# modinfo libceph > > filename: /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/net/ceph/libceph.ko.xz > > license: GPL > > description: Ceph core library > > author: Patience Warnick <patience@xxxxxxxxxxxx> > > author: Yehuda Sadeh <yehuda@xxxxxxxxxxxxxxx> > > author: Sage Weil <sage@xxxxxxxxxxxx> > > retpoline: Y > > rhelversion: 7.6 > > srcversion: 4F8CE6AEFA99B11C267981D > > depends: libcrc32c,dns_resolver > > intree: Y > > vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions > > signer: CentOS Linux kernel signing key > > sig_key: E7:CE:F3:61:3A:9B:8B:D0:12:FA:E7:49:82:72:15:9B:B1:87:9C:65 > > sig_hashalgo: sha256 > > [root@stratonode1 ~]# modinfo rbd > > filename: /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/block/rbd.ko.xz > > license: GPL > > description: RADOS Block Device (RBD) driver > > author: Jeff Garzik <jeff@xxxxxxxxxx> > > author: Yehuda Sadeh <yehuda@xxxxxxxxxxxxxxx> > > author: Sage Weil <sage@xxxxxxxxxxxx> > > author: Alex Elder <elder@xxxxxxxxxxx> > > retpoline: Y > > rhelversion: 7.6 > > srcversion: 5386BBBD00C262C66CB81F5 > > depends: libceph > > intree: Y > > vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions > > signer: CentOS Linux kernel signing key > > sig_key: E7:CE:F3:61:3A:9B:8B:D0:12:FA:E7:49:82:72:15:9B:B1:87:9C:65 > > sig_hashalgo: sha256 > > parm: single_major:Use a single major number for all rbd devices (default: true) (bool) > > > > I reported the issue here as well: > > https://bugs.centos.org/view.php?id=15681 > > > > > > Help will be appreciated. > > > > Thanks, > > Rom > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com