Hi, now we got a kernel crash (Oops) probably related to the my issue since it all seems to start with a hung mds (see attached dmesg from crashed client and mds log from mds server): [281202.923064] Oops: 0002 [#1] SMP [281202.924952] Modules linked in: fuse xt_multiport squashfs loop overlay(T) xt_CHECKSUM iptable_mangle tun bridge devlink ebtable_filter ebtables rpcsec_gss_krb5 nfsv4 nfs fscache ceph libceph dns_resolv er 8021q garp mrp stp llc bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ip6_tables ipt_REJECT nf_reject_ipv4 ib_srp xt_conntrack scsi_transport_srp scsi_tgt iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ib_ipoib nf_nat_ipv4 nf_nat nf_conntrack rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core d m_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support vfat fat sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel [281202.937437] lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev lpc_ich hpilo hpwdt sg ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunr pc binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_int el ixgbe drm tg3 hpsa mdio dca ptp drm_panel_orientation_quirks scsi_transport_sas pps_core [281202.949214] CPU: 41 PID: 17638 Comm: sh Kdump: loaded Tainted: G W ------------ T 3.10.0-1062.12.1.el7.x86_64 #1 [281202.951583] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9, BIOS U17 11/08/2017 [281202.953972] task: ffff8c0d71afb150 ti: ffff8b0e63404000 task.ti: ffff8b0e63404000 [281202.956360] RIP: 0010:[<ffffffffc0cf65b1>] [<ffffffffc0cf65b1>] ceph_put_snap_realm+0x21/0xe0 [ceph] [281202.958870] RSP: 0018:ffff8b0e63407be8 EFLAGS: 00010246 [281202.961256] RAX: 0000000000000050 RBX: 0000000000000000 RCX: 0000000000000000 [281202.963694] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff89b59b37bc00 [281202.966102] RBP: ffff8b0e63407c00 R08: 000000000000000a R09: 0000000000000000 [281202.968460] R10: 0000000000001e00 R11: ffff8b0e6340790e R12: ffff89b59b37bc00 [281202.970831] R13: 0000000000000001 R14: 00000000000000c6 R15: 0000000000000000 [281202.973168] FS: 00007f074d5e8740(0000) GS:ffff8a9e7fc40000(0000) knlGS:0000000000000000 [281202.975502] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [281202.977814] CR2: 0000000000000010 CR3: 0000016a50f3a000 CR4: 00000000003607e0 [281202.980144] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [281202.982474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [281202.984773] Call Trace: [281202.987156] [<ffffffffc0cfa40b>] check_quota_exceeded+0x1bb/0x270 [ceph] [281202.989508] [<ffffffffc0cfa7d4>] ceph_quota_is_max_bytes_exceeded+0x44/0x60 [ceph] [281202.991883] [<ffffffffc0ce2ef2>] ceph_aio_write+0x1e2/0xde0 [ceph] [281202.994258] [<ffffffff95c56b13>] ? lookup_fast+0xb3/0x230 [281202.996607] [<ffffffff95b5938d>] ? call_rcu_sched+0x1d/0x20 [281202.998947] [<ffffffff95c4d166>] ? put_filp+0x46/0x50 [281203.001236] [<ffffffff95c49d83>] do_sync_write+0x93/0xe0 [281203.003566] [<ffffffff95c4a870>] vfs_write+0xc0/0x1f0 [281203.005884] [<ffffffff95c4b68f>] SyS_write+0x7f/0xf0 [281203.008152] [<ffffffff9618dede>] system_call_fastpath+0x25/0x2a [281203.010368] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 f6 05 3a 8b 02 00 04 48 89 f3 0f 85 89 00 00 00 <f0> ff 4b 10 0f 94 c0 84 c0 75 0c 5b 41 5c 41 5d 5d c 3 0f 1f 44 [281203.015129] RIP [<ffffffffc0cf65b1>] ceph_put_snap_realm+0x21/0xe0 [ceph] [281203.017510] RSP <ffff8b0e63407be8> [281203.019743] CR2: 0000000000000010 # uname -a Linux zeus.icbi.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux More dmesg extract attached. Should I file a bug report? Dietmar On 2020-02-12 13:32, Dietmar Rieder wrote: > Hi, > > we sometimes loose access to our cephfs mount and get permission denied > if we try to cd into it. This happens apparently only on some of our HPC > cephfs-client nodes (fs mounted via kernel client) when they are busy > with calculation and I/O. > > When we then manually force unmount the fs and remount it, everything is > working again. > > This is the dmesg output of the affected client node: > <https://pastebin.com/z5wxUgYS> > > All HPC clients and ceph servers are running CentOS 7.7 with the same > kernel: > > $ uname -a > Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4 > 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux > > and all are running ceph version 14.2.7 > > $ ceph -v > ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus > (stable) > > Maybe someone has an idea what goes wrong, and how we can fix/avoid this. > > Thanks > Dietmar > -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Institute of Bioinformatics Email: dietmar.rieder@xxxxxxxxxxx Web: http://www.icbi.at _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx