On Wed, Sep 21, 2016 at 6:30 AM, Heller, Chris <cheller@xxxxxxxxxx> wrote: > I’m running a production 0.94.7 Ceph cluster, and have been seeing a > periodic issue arise where in all my MDS clients will become stuck, and the > fix so far has been to restart the active MDS (sometimes I need to restart > the subsequent active MDS as well). > > > > These clients are using the cephfs-hadoop API, so there is no kernel client, > or fuse api involved. When I see clients get stuck, there are messages > printed to stderr like the following: > > > > 2016-09-21 10:31:12.285030 7fea4c7fb700 0 – 192.168.1.241:0/1606648601 >> > 192.168.1.195:6801/1674 pipe(0x7feaa0a1e0f0 sd=206 :0 s=1 pgs=0 cs=0 l=0 > c=0x7feaa0a0c500).fault > > > > I’m at somewhat of a loss on where to begin debugging this issue, and wanted > to ping the list for ideas. What's the full output of "ceph -s" when this happens? Have you looked at the MDS' admin socket's ops-in-flight, and that of the clients?j http://docs.ceph.com/docs/master/cephfs/troubleshooting/ may help some as well. > > > > I managed to dump the mds cache during one of the stalled moments, which > hopefully is a useful starting point: > > > > e51bed37327a676e9974d740a13e173f11d1a11fdba5fbcf963b62023b06d7e8 > mdscachedump.txt.gz (https://filetea.me/t1sz3XPHxEVThOk8tvVTK5Bsg) > > > > > > -Chris > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com