We are running Ceph 10.2.7 and after adding a new multi-threaded writer application we are seeing hangs accessing metadata from ceph file system kernel mounted clients. I have a "du -ah /cephfs" process that been stuck for over 12 hours on one cephfs client system. We started seeing hung "du -ah" processes two days ago, so yesterday we upgraded the whole cluster from v10.2.5 to v10.2.7, but the problem occurred again last night. Rebooting the client fixes the problem. The ceph -s command is showing HEALTH_OK We have four ceph file system clients, each kernel mounting our 1 ceph file system to /cephfs. The "du -ah /cephfs" runs hourly within a test script that is cron controlled. If the du -ah /cephfs does not complete within an hour, emails are sent to the admin group as part of our monitoring process. This command normally takes less then a minute to run and we have just over 3.6M files in this file system. The du -ah is hanging while accessing sub-directories where the new multi-threaded writer application is writing. About the application: On one ceph client we are downloading external data via the network and writing data as files with a python program into the ceph file system. The python script can write up to 100 files in parallel. The metadata hangs we are seeing can occur on one or more client systems, but right now it is only hung on one system, which is not the node writing the data. System info: ceph -s cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4 health HEALTH_OK monmap e1: 3 mons at {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0} election epoch 138, quorum 0,1,2 mon01,mon02,mon03 fsmap e3210: 1/1/1 up {0=mds02=up:active}, 2 up:standby osdmap e33046: 85 osds: 85 up, 85 in flags sortbitwise,require_jewel_osds pgmap v27679236: 16192 pgs, 12 pools, 7655 GB data, 6591 kobjects 24345 GB used, 217 TB / 241 TB avail 16188 active+clean 3 active+clean+scrubbing 1 active+clean+scrubbing+deep client io 0 B/s rd, 15341 kB/s wr, 0 op/s rd, 21 op/s wr On the hung client node, we are seeing an entry in mdsc cat /sys/kernel/debug/ceph/*/mdsc 163925513 mds0 readdir #100003be2b1 kplr009658474_dr25_window.fits I am not seeing this on the other 3 client nodes. On the active metdata server, I ran: ceph daemon mds.mds02 dump_ops_in_flight every 2 seconds, as it kept changing. Part of the output is at: https://paste.fedoraproject.org/paste/OizCowo3oGzZo-cJWV5R~Q Info about the system OS: Ubuntu Trusty Cephfs snapshots are turned on and being created hourly Ceph Version ceph -v ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) Kernel: Ceph Servers: uname -a Linux mon01 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Kernel Cephfs clients: uname -a Linux dfgw02 4.9.21-040921-generic #201704080434 SMP Sat Apr 8 08:35:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Let me know if I should write up a ticket on this. Thanks Eric _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com