I have created a ticket on this issue: http://tracker.ceph.com/issues/20329 On Thu, Jun 15, 2017 at 12:14 PM, Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> wrote: > On Thu, Jun 15, 2017 at 11:45 AM, David Turner <drakonstein@xxxxxxxxx> wrote: >> Have you compared performance to mounting cephfs using ceph-fuse instead of >> the kernel client? > > We have tested both, and with our applications the kernel mounted file > systems have been much faster then the fuse mounted tests. > >> A very interesting thing that ceph-fuse does is that an ls -lhd of a directory >> shows the directory structures size. It's a drastically faster response >> than a du for the size of a folder. > > The "du -ah" is run to scan for hangs. We only look at the output > when there is a problem. A while ago we had a 4.9 kernel issue that > was causing hangs, so we put in the du -ah to walk the file system > hourly to report if it was hung, and left it in after we installed the > 4.9.21 kernel that had the fix. Until we started running the new > application, the system had been very stable. > >> If you're deleting snapshots each hour as well, that might be a place to look for odd cluster happenings as well. > > Currently the file system is only 10% full, so we are not deleting any > snapshots. > > Even if our application is not properly architected for a shared file > system, the file system should not hang. > > Thanks, > Eric > >> >> On Thu, Jun 15, 2017 at 12:39 PM Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> >> wrote: >>> >>> We are running Ceph 10.2.7 and after adding a new multi-threaded >>> writer application we are seeing hangs accessing metadata from ceph >>> file system kernel mounted clients. I have a "du -ah /cephfs" process >>> that been stuck for over 12 hours on one cephfs client system. We >>> started seeing hung "du -ah" processes two days ago, so yesterday we >>> upgraded the whole cluster from v10.2.5 to v10.2.7, but the problem >>> occurred again last night. Rebooting the client fixes the problem. >>> The ceph -s command is showing HEALTH_OK >>> >>> We have four ceph file system clients, each kernel mounting our 1 ceph >>> file system to /cephfs. The "du -ah /cephfs" runs hourly within a test >>> script that is cron controlled. If the du -ah /cephfs does not >>> complete within an hour, emails are sent to the admin group as part of >>> our monitoring process. This command normally takes less then a minute >>> to run and we have just over 3.6M files in this file system. The du >>> -ah is hanging while accessing sub-directories where the new >>> multi-threaded writer application is writing. >>> >>> About the application: On one ceph client we are downloading external >>> data via the network and writing data as files with a python program >>> into the ceph file system. The python script can write up to 100 files >>> in parallel. The metadata hangs we are seeing can occur on one or more >>> client systems, but right now it is only hung on one system, which is >>> not the node writing the data. >>> >>> System info: >>> >>> ceph -s >>> cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4 >>> health HEALTH_OK >>> monmap e1: 3 mons at >>> >>> {mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0} >>> election epoch 138, quorum 0,1,2 mon01,mon02,mon03 >>> fsmap e3210: 1/1/1 up {0=mds02=up:active}, 2 up:standby >>> osdmap e33046: 85 osds: 85 up, 85 in >>> flags sortbitwise,require_jewel_osds >>> pgmap v27679236: 16192 pgs, 12 pools, 7655 GB data, 6591 kobjects >>> 24345 GB used, 217 TB / 241 TB avail >>> 16188 active+clean >>> 3 active+clean+scrubbing >>> 1 active+clean+scrubbing+deep >>> client io 0 B/s rd, 15341 kB/s wr, 0 op/s rd, 21 op/s wr >>> >>> >>> On the hung client node, we are seeing an entry in mdsc >>> cat /sys/kernel/debug/ceph/*/mdsc >>> 163925513 mds0 readdir #100003be2b1 kplr009658474_dr25_window.fits >>> >>> I am not seeing this on the other 3 client nodes. >>> >>> On the active metdata server, I ran: >>> >>> ceph daemon mds.mds02 dump_ops_in_flight >>> >>> every 2 seconds, as it kept changing. Part of the output is at: >>> https://paste.fedoraproject.org/paste/OizCowo3oGzZo-cJWV5R~Q >>> >>> Info about the system >>> >>> OS: Ubuntu Trusty >>> >>> Cephfs snapshots are turned on and being created hourly >>> >>> Ceph Version >>> ceph -v >>> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) >>> >>> Kernel: Ceph Servers: >>> uname -a >>> Linux mon01 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 >>> 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux >>> >>> Kernel Cephfs clients: >>> uname -a >>> Linux dfgw02 4.9.21-040921-generic #201704080434 SMP Sat Apr 8 >>> 08:35:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux >>> >>> Let me know if I should write up a ticket on this. >>> >>> Thanks >>> >>> Eric >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com