Re: centos 7.6 kernel panic caused by osd

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 10 Jan 2019 18:57:22 -0500

I think Ilya recently looked into a bug that can occur when
CONFIG_HARDENED_USERCOPY is enabled and the IO's TCP message goes
through the loopback interface (i.e. co-located OSDs and krbd).
Assuming that you have the same setup, you might be hitting the same
bug.

On Thu, Jan 10, 2019 at 6:46 PM Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>
> On Fri, Jan 11, 2019 at 12:20 AM Rom Freiman <rom@xxxxxxxxxxxxxxx> wrote:
> >
> > Hey,
> > After upgrading to centos7.6, I started encountering the following kernel panic
> >
> > [17845.147263] XFS (rbd4): Unmounting Filesystem
> > [17846.860221] rbd: rbd4: capacity 3221225472 features 0x1
> > [17847.109887] XFS (rbd4): Mounting V5 Filesystem
> > [17847.191646] XFS (rbd4): Ending clean mount
> > [17861.663757] rbd: rbd5: capacity 3221225472 features 0x1
> > [17862.930418] usercopy: kernel memory exposure attempt detected from ffff9d54d26d8800 (kmalloc-512) (1024 bytes)
> > [17862.941698] ------------[ cut here ]------------
> > [17862.946854] kernel BUG at mm/usercopy.c:72!
> > [17862.951524] invalid opcode: 0000 [#1] SMP
> > [17862.956123] Modules linked in: vhost_net vhost macvtap macvlan tun xt_REDIRECT nf_nat_redirect ip6table_mangle xt_nat xt_mark xt_connmark xt_CHECKSUM ip6table_raw xt_physdev iptable_mangle veth iptable_raw rbd libceph dns_resolver ebtable_filter ebtables ip6table_filter ip6_tables xt_comment mlx4_en(OE) mlx4_core(OE) xt_multiport ipt_REJECT nf_reject_ipv4 nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc xfs openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack mlx5_core(OE) mlxfw(OE) iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass pcspkr joydev sg mei_me lpc_ich i2c_i801 mei ioatdma ipmi_si ipmi_devintf ipmi_msghandler
> > [17863.036328]  dm_multipath ip_tables ext4 mbcache jbd2 dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel mgag200 igb aesni_intel isci lrw gf128mul glue_helper ablk_helper ahci drm_kms_helper cryptd libsas dca syscopyarea sysfillrect sysimgblt fb_sys_fops ttm libahci scsi_transport_sas ptp drm libata pps_core mlx_compat(OE) drm_panel_orientation_quirks i2c_algo_bit devlink wmi scsi_transport_iscsi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mlx4_core]
> > [17863.094372] CPU: 3 PID: 71755 Comm: msgr-worker-1 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.1.3.el7.x86_64 #1
> > [17863.107673] Hardware name: Intel Corporation S2600JF/S2600JF, BIOS SE5C600.86B.02.06.0006.032420170950 03/24/2017
> > [17863.119134] task: ffff9d4e8e33e180 ti: ffff9d53dbaf8000 task.ti: ffff9d53dbaf8000
> > [17863.127489] RIP: 0010:[<ffffffffa5e3e167>]  [<ffffffffa5e3e167>] __check_object_size+0x87/0x250
> > [17863.137217] RSP: 0018:ffff9d53dbafbb98  EFLAGS: 00010246
> > [17863.143140] RAX: 0000000000000062 RBX: ffff9d54d26d8800 RCX: 0000000000000000
> > [17863.151106] RDX: 0000000000000000 RSI: ffff9d557bad3898 RDI: ffff9d557bad3898
> > [17863.159072] RBP: ffff9d53dbafbbb8 R08: 0000000000000000 R09: 0000000000000000
> > [17863.167038] R10: 0000000000000d0f R11: ffff9d53dbafb896 R12: 0000000000000400
> > [17863.175001] R13: 0000000000000001 R14: ffff9d54d26d8c00 R15: 0000000000000400
> > [17863.182968] FS:  00007f531fa98700(0000) GS:ffff9d557bac0000(0000) knlGS:0000000000000000
> > [17863.192001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [17863.198414] CR2: 00007f4438516930 CR3: 0000000f19236000 CR4: 00000000001627e0
> > [17863.206379] Call Trace:
> > [17863.209114]  [<ffffffffa5f8c0dd>] memcpy_toiovec+0x4d/0xb0
> > [17863.215240]  [<ffffffffa622a858>] skb_copy_datagram_iovec+0x128/0x280
> > [17863.222434]  [<ffffffffa629172a>] tcp_recvmsg+0x22a/0xb30
> > [17863.228463]  [<ffffffffa62c00e0>] inet_recvmsg+0x80/0xb0
> > [17863.234395]  [<ffffffffa62186ec>] sock_aio_read.part.9+0x14c/0x170
> > [17863.241297]  [<ffffffffa5cd676b>] ? wake_up_q+0x5b/0x80
> > [17863.247129]  [<ffffffffa6218731>] sock_aio_read+0x21/0x30
> > [17863.253157]  [<ffffffffa5e40743>] do_sync_read+0x93/0xe0
> > [17863.259087]  [<ffffffffa5e41225>] vfs_read+0x145/0x170
> > [17863.264823]  [<ffffffffa5e4203f>] SyS_read+0x7f/0xf0
> > [17863.270366]  [<ffffffffa6374ddb>] system_call_fastpath+0x22/0x27
> > [17863.277061] Code: 45 d1 48 c7 c6 d4 b6 67 a6 48 c7 c1 e0 4b 68 a6 48 0f 45 f1 49 89 c0 4d 89 e1 48 89 d9 48 c7 c7 d0 1a 68 a6 31 c0 e8 20 d5 51 00 <0f> 0b 0f 1f 80 00 00 00 00 48 c7 c0 00 00 c0 a5 4c 39 f0 73 0d
> > [17863.298802] RIP  [<ffffffffa5e3e167>] __check_object_size+0x87/0x250
> > [17863.305912]  RSP <ffff9d53dbafbb98>
> >
> > It seems to be related to rbd operations but I cannot pinpoint directly the reason.
>
> To me this seems to be an issue in the networking subsystem and there
> is nothing, at this stage, that implicates the ceph modules.
>
> If the Mellanox modules are involved in any way I would start looking
> there (not because I am biased against them, but because experience
> tells me that is the place to start) and then move on to the other
> networking modules and the kernel more generally. This looks like some
> sort of memory accounting error in the networking subsystem. I could
> be wrong, of course, but there would need to be further data to tell
> either way. I'd suggest capturing a vmcore and getting someone to
> analyse it for you would be a good next step.
>
> >
> > Versions:
> > CentOS Linux release 7.6.1810 (Core)
> > Linux stratonode1.node.strato 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > librbd1-12.2.8-0.el7.x86_64
> >
> >
> > [root@stratonode1 ~]# modinfo libceph
> > filename: /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/net/ceph/libceph.ko.xz
> > license: GPL
> > description: Ceph core library
> > author: Patience Warnick <patience@xxxxxxxxxxxx>
> > author: Yehuda Sadeh <yehuda@xxxxxxxxxxxxxxx>
> > author: Sage Weil <sage@xxxxxxxxxxxx>
> > retpoline: Y
> > rhelversion: 7.6
> > srcversion: 4F8CE6AEFA99B11C267981D
> > depends: libcrc32c,dns_resolver
> > intree: Y
> > vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
> > signer: CentOS Linux kernel signing key
> > sig_key: E7:CE:F3:61:3A:9B:8B:D0:12:FA:E7:49:82:72:15:9B:B1:87:9C:65
> > sig_hashalgo: sha256
> > [root@stratonode1 ~]# modinfo rbd
> > filename: /lib/modules/3.10.0-957.1.3.el7.x86_64/kernel/drivers/block/rbd.ko.xz
> > license: GPL
> > description: RADOS Block Device (RBD) driver
> > author: Jeff Garzik <jeff@xxxxxxxxxx>
> > author: Yehuda Sadeh <yehuda@xxxxxxxxxxxxxxx>
> > author: Sage Weil <sage@xxxxxxxxxxxx>
> > author: Alex Elder <elder@xxxxxxxxxxx>
> > retpoline: Y
> > rhelversion: 7.6
> > srcversion: 5386BBBD00C262C66CB81F5
> > depends: libceph
> > intree: Y
> > vermagic: 3.10.0-957.1.3.el7.x86_64 SMP mod_unload modversions
> > signer: CentOS Linux kernel signing key
> > sig_key: E7:CE:F3:61:3A:9B:8B:D0:12:FA:E7:49:82:72:15:9B:B1:87:9C:65
> > sig_hashalgo: sha256
> > parm: single_major:Use a single major number for all rbd devices (default: true) (bool)
> >
> > I reported the issue here as well:
> > https://bugs.centos.org/view.php?id=15681
> >
> >
> > Help will be appreciated.
> >
> > Thanks,
> > Rom
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com