Hi, the attachments got removed from my previous message, here the pastebins: client vmcore-dmesg: https://pastebin.com/AFZgkpaK mds.log: https://pastebin.com/FUU6hyya Best Dietmar On 2020-02-13 05:00, Dietmar Rieder wrote: > Hi, > > now we got a kernel crash (Oops) probably related to the my issue since > it all seems to start with a hung mds (see attached dmesg from crashed > client and mds log from mds server): > > [281202.923064] Oops: 0002 [#1] SMP > [281202.924952] Modules linked in: fuse xt_multiport squashfs loop > overlay(T) xt_CHECKSUM iptable_mangle tun bridge devlink ebtable_filter > ebtables rpcsec_gss_krb5 nfsv4 nfs fscache ceph libceph dns_resolv > er 8021q garp mrp stp llc bonding rpcrdma ib_isert iscsi_target_mod > ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ip6_tables > ipt_REJECT nf_reject_ipv4 ib_srp xt_conntrack scsi_transport_srp > scsi_tgt iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ib_ipoib nf_nat_ipv4 nf_nat > nf_conntrack rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core d > m_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support vfat > fat sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm > irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel > [281202.937437] lrw gf128mul glue_helper ablk_helper cryptd pcspkr > joydev lpc_ich hpilo hpwdt sg ioatdma wmi ipmi_si ipmi_devintf > ipmi_msghandler acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunr > pc binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif > crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea > sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common > crc32c_int > el ixgbe drm tg3 hpsa mdio dca ptp drm_panel_orientation_quirks > scsi_transport_sas pps_core > [281202.949214] CPU: 41 PID: 17638 Comm: sh Kdump: loaded Tainted: G > W ------------ T 3.10.0-1062.12.1.el7.x86_64 #1 > [281202.951583] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 > Gen9, BIOS U17 11/08/2017 > [281202.953972] task: ffff8c0d71afb150 ti: ffff8b0e63404000 task.ti: > ffff8b0e63404000 > [281202.956360] RIP: 0010:[<ffffffffc0cf65b1>] [<ffffffffc0cf65b1>] > ceph_put_snap_realm+0x21/0xe0 [ceph] > [281202.958870] RSP: 0018:ffff8b0e63407be8 EFLAGS: 00010246 > [281202.961256] RAX: 0000000000000050 RBX: 0000000000000000 RCX: > 0000000000000000 > [281202.963694] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff89b59b37bc00 > [281202.966102] RBP: ffff8b0e63407c00 R08: 000000000000000a R09: > 0000000000000000 > [281202.968460] R10: 0000000000001e00 R11: ffff8b0e6340790e R12: > ffff89b59b37bc00 > [281202.970831] R13: 0000000000000001 R14: 00000000000000c6 R15: > 0000000000000000 > [281202.973168] FS: 00007f074d5e8740(0000) GS:ffff8a9e7fc40000(0000) > knlGS:0000000000000000 > [281202.975502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [281202.977814] CR2: 0000000000000010 CR3: 0000016a50f3a000 CR4: > 00000000003607e0 > [281202.980144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [281202.982474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [281202.984773] Call Trace: > [281202.987156] [<ffffffffc0cfa40b>] check_quota_exceeded+0x1bb/0x270 > [ceph] > [281202.989508] [<ffffffffc0cfa7d4>] > ceph_quota_is_max_bytes_exceeded+0x44/0x60 [ceph] > [281202.991883] [<ffffffffc0ce2ef2>] ceph_aio_write+0x1e2/0xde0 [ceph] > [281202.994258] [<ffffffff95c56b13>] ? lookup_fast+0xb3/0x230 > [281202.996607] [<ffffffff95b5938d>] ? call_rcu_sched+0x1d/0x20 > [281202.998947] [<ffffffff95c4d166>] ? put_filp+0x46/0x50 > [281203.001236] [<ffffffff95c49d83>] do_sync_write+0x93/0xe0 > [281203.003566] [<ffffffff95c4a870>] vfs_write+0xc0/0x1f0 > [281203.005884] [<ffffffff95c4b68f>] SyS_write+0x7f/0xf0 > [281203.008152] [<ffffffff9618dede>] system_call_fastpath+0x25/0x2a > [281203.010368] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 > 89 e5 41 55 41 54 49 89 fc 53 f6 05 3a 8b 02 00 04 48 89 f3 0f 85 89 00 > 00 00 <f0> ff 4b 10 0f 94 c0 84 c0 75 0c 5b 41 5c 41 5d 5d c > 3 0f 1f 44 > [281203.015129] RIP [<ffffffffc0cf65b1>] ceph_put_snap_realm+0x21/0xe0 > [ceph] > [281203.017510] RSP <ffff8b0e63407be8> > [281203.019743] CR2: 0000000000000010 > > > # uname -a > Linux zeus.icbi.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 > 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux > > More dmesg extract attached. > > Should I file a bug report? > > Dietmar > > On 2020-02-12 13:32, Dietmar Rieder wrote: >> Hi, >> >> we sometimes loose access to our cephfs mount and get permission denied >> if we try to cd into it. This happens apparently only on some of our HPC >> cephfs-client nodes (fs mounted via kernel client) when they are busy >> with calculation and I/O. >> >> When we then manually force unmount the fs and remount it, everything is >> working again. >> >> This is the dmesg output of the affected client node: >> <https://pastebin.com/z5wxUgYS> >> >> All HPC clients and ceph servers are running CentOS 7.7 with the same >> kernel: >> >> $ uname -a >> Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 >> 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux >> >> and all are running ceph version 14.2.7 >> >> $ ceph -v >> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus >> (stable) >> >> Maybe someone has an idea what goes wrong, and how we can fix/avoid this. >> >> Thanks >> Dietmar >> > > -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Institute of Bioinformatics Email: dietmar.rieder@xxxxxxxxxxx Web: http://www.icbi.at _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx