Re: CephFS hangs with access denied

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Thu, 13 Feb 2020 09:18:03 +0100



Hi,

the attachments got removed from my previous message, here the pastebins:

client vmcore-dmesg:
https://pastebin.com/AFZgkpaK

mds.log:
https://pastebin.com/FUU6hyya

Best
  Dietmar

On 2020-02-13 05:00, Dietmar Rieder wrote:
> Hi,
> 
> now we got a kernel crash (Oops) probably related to the my issue since
> it all seems to start with a hung mds (see attached dmesg from crashed
> client and mds log from mds server):
> 
> [281202.923064] Oops: 0002 [#1] SMP
> [281202.924952] Modules linked in: fuse xt_multiport squashfs loop
> overlay(T) xt_CHECKSUM iptable_mangle tun bridge devlink ebtable_filter
> ebtables rpcsec_gss_krb5 nfsv4 nfs fscache ceph libceph dns_resolv
> er 8021q garp mrp stp llc bonding rpcrdma ib_isert iscsi_target_mod
> ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ip6_tables
> ipt_REJECT nf_reject_ipv4 ib_srp xt_conntrack scsi_transport_srp
>  scsi_tgt iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 ib_ipoib nf_nat_ipv4 nf_nat
> nf_conntrack rdma_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core d
> m_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support vfat
> fat sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm
> irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel
> [281202.937437]  lrw gf128mul glue_helper ablk_helper cryptd pcspkr
> joydev lpc_ich hpilo hpwdt sg ioatdma wmi ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunr
> pc binfmt_misc ip_tables xfs libcrc32c sd_mod crc_t10dif
> crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea
> sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common
> crc32c_int
> el ixgbe drm tg3 hpsa mdio dca ptp drm_panel_orientation_quirks
> scsi_transport_sas pps_core
> [281202.949214] CPU: 41 PID: 17638 Comm: sh Kdump: loaded Tainted: G
>     W      ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
> [281202.951583] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580
> Gen9, BIOS U17 11/08/2017
> [281202.953972] task: ffff8c0d71afb150 ti: ffff8b0e63404000 task.ti:
> ffff8b0e63404000
> [281202.956360] RIP: 0010:[<ffffffffc0cf65b1>]  [<ffffffffc0cf65b1>]
> ceph_put_snap_realm+0x21/0xe0 [ceph]
> [281202.958870] RSP: 0018:ffff8b0e63407be8  EFLAGS: 00010246
> [281202.961256] RAX: 0000000000000050 RBX: 0000000000000000 RCX:
> 0000000000000000
> [281202.963694] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff89b59b37bc00
> [281202.966102] RBP: ffff8b0e63407c00 R08: 000000000000000a R09:
> 0000000000000000
> [281202.968460] R10: 0000000000001e00 R11: ffff8b0e6340790e R12:
> ffff89b59b37bc00
> [281202.970831] R13: 0000000000000001 R14: 00000000000000c6 R15:
> 0000000000000000
> [281202.973168] FS:  00007f074d5e8740(0000) GS:ffff8a9e7fc40000(0000)
> knlGS:0000000000000000
> [281202.975502] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [281202.977814] CR2: 0000000000000010 CR3: 0000016a50f3a000 CR4:
> 00000000003607e0
> [281202.980144] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [281202.982474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [281202.984773] Call Trace:
> [281202.987156]  [<ffffffffc0cfa40b>] check_quota_exceeded+0x1bb/0x270
> [ceph]
> [281202.989508]  [<ffffffffc0cfa7d4>]
> ceph_quota_is_max_bytes_exceeded+0x44/0x60 [ceph]
> [281202.991883]  [<ffffffffc0ce2ef2>] ceph_aio_write+0x1e2/0xde0 [ceph]
> [281202.994258]  [<ffffffff95c56b13>] ? lookup_fast+0xb3/0x230
> [281202.996607]  [<ffffffff95b5938d>] ? call_rcu_sched+0x1d/0x20
> [281202.998947]  [<ffffffff95c4d166>] ? put_filp+0x46/0x50
> [281203.001236]  [<ffffffff95c49d83>] do_sync_write+0x93/0xe0
> [281203.003566]  [<ffffffff95c4a870>] vfs_write+0xc0/0x1f0
> [281203.005884]  [<ffffffff95c4b68f>] SyS_write+0x7f/0xf0
> [281203.008152]  [<ffffffff9618dede>] system_call_fastpath+0x25/0x2a
> [281203.010368] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
> 89 e5 41 55 41 54 49 89 fc 53 f6 05 3a 8b 02 00 04 48 89 f3 0f 85 89 00
> 00 00 <f0> ff 4b 10 0f 94 c0 84 c0 75 0c 5b 41 5c 41 5d 5d c
> 3 0f 1f 44
> [281203.015129] RIP  [<ffffffffc0cf65b1>] ceph_put_snap_realm+0x21/0xe0
> [ceph]
> [281203.017510]  RSP <ffff8b0e63407be8>
> [281203.019743] CR2: 0000000000000010
> 
> 
> # uname -a
> Linux zeus.icbi.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4
> 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
> 
> More dmesg extract attached.
> 
> Should I file a bug report?
> 
> Dietmar
> 
> On 2020-02-12 13:32, Dietmar Rieder wrote:
>> Hi,
>>
>> we sometimes loose access to our cephfs mount and get permission denied
>> if we try to cd into it. This happens apparently only on some of our HPC
>> cephfs-client nodes (fs mounted via kernel client) when they are busy
>> with calculation and I/O.
>>
>> When we then manually force unmount the fs and remount it, everything is
>> working again.
>>
>> This is the dmesg output of the affected client node:
>> <https://pastebin.com/z5wxUgYS>
>>
>> All HPC clients and ceph servers are running CentOS 7.7 with the same
>> kernel:
>>
>> $ uname -a
>> Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4
>> 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
>>
>> and all are running ceph version 14.2.7
>>
>> $ ceph -v
>> ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
>> (stable)
>>
>> Maybe someone has an idea what goes wrong, and how we can fix/avoid this.
>>
>> Thanks
>>   Dietmar
>>
> 
> 


-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Email: dietmar.rieder@xxxxxxxxxxx
Web:   http://www.icbi.at
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx