Re: CephFS hangs with access denied

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Thu, 13 Feb 2020 05:00:25 +0100

Hi,

now we got a kernel crash (Oops) probably related to the my issue since
it all seems to start with a hung mds (see attached dmesg from crashed
client and mds log from mds server):

[281202.923064] Oops: 0002 [#1] SMP
[281202.924952] Modules linked in: fuse xt_multiport squashfs loop
overlay(T) xt_CHECKSUM iptable_mangle tun bridge devlink ebtable_filter
ebtables rpcsec_gss_krb5 nfsv4 nfs fscache ceph libceph dns_resolv
er 8021q garp mrp stp llc bonding rpcrdma ib_isert iscsi_target_mod
ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ip6_tables
ipt_REJECT nf_reject_ipv4 ib_srp xt_conntrack scsi_transport_srp
 scsi_tgt iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ib_ipoib nf_nat_ipv4 nf_nat
nf_conntrack rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core d
m_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support vfat
fat sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm
irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel
[281202.937437]  lrw gf128mul glue_helper ablk_helper cryptd pcspkr
joydev lpc_ich hpilo hpwdt sg ioatdma wmi ipmi_si ipmi_devintf
ipmi_msghandler acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunr
pc binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif
crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common
crc32c_int
el ixgbe drm tg3 hpsa mdio dca ptp drm_panel_orientation_quirks
scsi_transport_sas pps_core
[281202.949214] CPU: 41 PID: 17638 Comm: sh Kdump: loaded Tainted: G
    W      ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
[281202.951583] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580
Gen9, BIOS U17 11/08/2017
[281202.953972] task: ffff8c0d71afb150 ti: ffff8b0e63404000 task.ti:
ffff8b0e63404000
[281202.956360] RIP: 0010:[<ffffffffc0cf65b1>]  [<ffffffffc0cf65b1>]
ceph_put_snap_realm+0x21/0xe0 [ceph]
[281202.958870] RSP: 0018:ffff8b0e63407be8  EFLAGS: 00010246
[281202.961256] RAX: 0000000000000050 RBX: 0000000000000000 RCX:
0000000000000000
[281202.963694] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff89b59b37bc00
[281202.966102] RBP: ffff8b0e63407c00 R08: 000000000000000a R09:
0000000000000000
[281202.968460] R10: 0000000000001e00 R11: ffff8b0e6340790e R12:
ffff89b59b37bc00
[281202.970831] R13: 0000000000000001 R14: 00000000000000c6 R15:
0000000000000000
[281202.973168] FS:  00007f074d5e8740(0000) GS:ffff8a9e7fc40000(0000)
knlGS:0000000000000000
[281202.975502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[281202.977814] CR2: 0000000000000010 CR3: 0000016a50f3a000 CR4:
00000000003607e0
[281202.980144] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[281202.982474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[281202.984773] Call Trace:
[281202.987156]  [<ffffffffc0cfa40b>] check_quota_exceeded+0x1bb/0x270
[ceph]
[281202.989508]  [<ffffffffc0cfa7d4>]
ceph_quota_is_max_bytes_exceeded+0x44/0x60 [ceph]
[281202.991883]  [<ffffffffc0ce2ef2>] ceph_aio_write+0x1e2/0xde0 [ceph]
[281202.994258]  [<ffffffff95c56b13>] ? lookup_fast+0xb3/0x230
[281202.996607]  [<ffffffff95b5938d>] ? call_rcu_sched+0x1d/0x20
[281202.998947]  [<ffffffff95c4d166>] ? put_filp+0x46/0x50
[281203.001236]  [<ffffffff95c49d83>] do_sync_write+0x93/0xe0
[281203.003566]  [<ffffffff95c4a870>] vfs_write+0xc0/0x1f0
[281203.005884]  [<ffffffff95c4b68f>] SyS_write+0x7f/0xf0
[281203.008152]  [<ffffffff9618dede>] system_call_fastpath+0x25/0x2a
[281203.010368] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
89 e5 41 55 41 54 49 89 fc 53 f6 05 3a 8b 02 00 04 48 89 f3 0f 85 89 00
00 00 <f0> ff 4b 10 0f 94 c0 84 c0 75 0c 5b 41 5c 41 5d 5d c
3 0f 1f 44
[281203.015129] RIP  [<ffffffffc0cf65b1>] ceph_put_snap_realm+0x21/0xe0
[ceph]
[281203.017510]  RSP <ffff8b0e63407be8>
[281203.019743] CR2: 0000000000000010

# uname -a
Linux zeus.icbi.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4
23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

More dmesg extract attached.

Should I file a bug report?

Dietmar

On 2020-02-12 13:32, Dietmar Rieder wrote:
> Hi,
> 
> we sometimes loose access to our cephfs mount and get permission denied
> if we try to cd into it. This happens apparently only on some of our HPC
> cephfs-client nodes (fs mounted via kernel client) when they are busy
> with calculation and I/O.
> 
> When we then manually force unmount the fs and remount it, everything is
> working again.
> 
> This is the dmesg output of the affected client node:
> <https://pastebin.com/z5wxUgYS>
> 
> All HPC clients and ceph servers are running CentOS 7.7 with the same
> kernel:
> 
> $ uname -a
> Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4
> 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> and all are running ceph version 14.2.7
> 
> $ ceph -v
> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
> (stable)
> 
> Maybe someone has an idea what goes wrong, and how we can fix/avoid this.
> 
> Thanks
>   Dietmar
> 

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx